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The National Association of Scholars is an independent membership 
association of academics and others working to sustain the tradition of 
reasoned scholarship and civil debate in America’s colleges and univer- 
sities. We uphold the standards of a liberal arts education that fosters 
intellectual freedom, searches for the truth, and promotes virtuous 


citizenship. 
What We Do 


We publish a quarterly journal, Academic Questions, which examines the 
intellectual controversies and the institutional challenges of contemporary 
higher education. 

We publish studies of current higher education policy and prac- 
tice with the aim of drawing attention to weaknesses and stimulating 


improvements. 


NAS engages in public advocacy to pass legislation to advance the 
cause of higher education reform. We file friend-of-the-court briefs in 
legal cases defending freedom of speech and conscience and the civil 
rights of educators and students. We give testimony before congressio- 
nal and legislative committees and engage public support for worthy 
reforms. 

NAS holds national and regional meetings that focus on important 


issues and public policy debates in higher education today. 


Membership 

NAS membership is open to all who share a commitment to its core 
principles of fostering intellectual freedom and academic excellence in 
American higher education. A large majority of our members are current 
and former faculty members. We also welcome graduate and undergrad- 
uate students, teachers, college administrators, and independent schol- 
ars, as wellas non-academic citizens who care about the future of higher 
education. 

NAS members receive a subscription to our journal Academic Questions 
and access to a network of people who share a commitment to academic 
freedom and excellence. We offer opportunities to influence key aspects 
of contemporary higher education. 

Visit our website, www.nas.org, to learn more about NAS and to become 


amember. 
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n irreproducibility crisis afflicts a wide range of scientific and 
social-scientific disciplines, from epidemiology to social 
psychology. Improper research techniques, a lack of account- 
ability, disciplinary and political groupthink, and a scientific culture 
biased toward producing positive results contribute to this plight. 
Other factors include inadequate or compromised peer review, secrecy, 
conflicts of interest, ideological commitments, and outright dishonesty. 

Science has always had a layer of untrustworthy results published 
in respectable places and “experts” who are eventually shown to 
have been sloppy, mistaken, or untruthful in their reported findings. 
Irreproducibility itself is nothing new. Science advances, in part, by 
learning how to discard false hypotheses, which sometimes means 
dismissing reported data that does not stand the test of independent 
reproduction. 

But the irreproducibility crisis is something new. The magnitude 
of false (or simply irreproducible) results reported as authoritative in 
journals of record appears to have dramatically increased. “Appears” is 
a word of caution, since we do not know with any precision how much 
unreliable reporting occurred in the sciences in previous eras. Today, 
given the vast scale of modern science, even if the percentage of unre- 
liable reports has remained fairly constant over the decades, the sheer 
number of irreproducible studies has grown vastly. Moreover, the 


contemporary practice of science, which depends on a regular flow of 
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large state expenditures, means that the public is, in effect, buying a 
product rife with defects. On top of this, the regulatory state frequently 
builds both its cases for regulation and the substance of its regulations on 
the basis of unproven, unreliable, and sometimes false scientific claims. 

In short, many supposedly scientific results cannot be reproduced 
reliably in subsequent investigations and offer no trustworthy insight 
into the way the world works. A majority of modern research findings in 
many disciplines may well be wrong. 

That was how the National Association of Scholars summarized 
matters in our report The Irreproducibility Crisis of Modern Science: Causes, 
Consequences, and the Road to Reform (2018).' Since then we have continued 
our work to press for reproducibility reform by several different avenues. 
In February 2020, we co-sponsored with the Independent Institute 
an interdisciplinary conference on Fixing Science: Practical Solutions 

for the Irreproducibility Crisis, to publicize the irreproducibility crisis, 
exchange information across disciplinary lines, and canvass (as the title 
of the conference suggests) practical solutions for the irreproducibility 
crisis.* We have also provided a series of public comments in support of 
the Environmental Protection Agency’s rule Strengthening Transparency 
in Pivotal Science Underlying Significant Regulatory Actions and Influential 
Scientific Information.® We have publicized different aspects of the irre- 


producibility crisis by way of podcasts and short articles.’ 


1 David Randall and Christopher Welser, The Irreproducibility Crisis of Modern Science: Causes, Consequenc- 
es, and the Road to Reform (National Association of Scholars, 2018), hitps://www.nas.org/reports/the-irre- 
producibility-crisis-of-modern-science. 

2 Fixing Science: Practical Solutions for the Irreproducibility Crisis, YouTube, hitps://www.youtube.com/ 
watch?v=eee6KIoEUR4&list=PL-mariB2b6NugvvjAFeAjK-_-Y6wXCkvM; “Conference Follow-up: Fixing 
Science,” National Association of Scholars, February 19, 2020, https://www.nas.org/blogs/article/confer- 
ence-follow-up-fixing-science. 

3 “UPDATED: NAS Public Comment on Strengthening Transparency in Regulatory Science,” National 
Association of Scholars, June 19, 2018, https://www.nas.org/blogs/article/updated_nas_public_comment_ 
on_strengthening_transparency_in_regulatory_scie; Peter Wood, "NAS Comments on EPA's Proposed 
Supplemental Notice of Proposed Rulemaking,” March 23, 2020, htips://www.nas.org/blogs/article/ 
nas-comment-on-epas-proposed-supplemental-notice-of-proposed-rulemaking; “Comments on EPA's Final 
Rule, ‘Strengthening Transparency’,” National Association of Scholars, January 12, 2021, hitps://www.nas. 
org/blogs/article/nas-comments-on-epas-final-rule-strengthening-transparency. 

4 “Episode #51: Rabble Rousing with Lee Jussim,” https://www.nas.org/blogs/media/episode-51-rabble- 
rousing-with-lee-jussim; “Legally Wrong: When Courts and Science Meet with Nathan Schachtman,” https:// 
www.nas.org/blogs/media/legally-wrong-when-politics-and-science-meet-with-nathan-schactman; David 
Randall, “Bad Science Makes for Bad Government,” National Association of Scholars, September 19, 2019, 
https://www.nas.org/blogs/article/bad-science-makes-for-bad-government; Edward Reid, “Irreproducibility 
and Climate Science,” National Association of Scholars, May 17, 2018, https://www.nas.org/blogs/article/ 
irreproducibility_and_climate_science. 
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And we have begun work on our Shifting Sands project. This report is 
the first of four that will appear as part of Shifting Sands, each of which 
will address the role of the irreproducibility crisis in different areas of 
federal regulatory policy. Here we address a central question that arose 


after we published The Irreproducibility Crisis. 


You’ve shown that a great deal of science hasn’t been repro- 
duced properly and may well be irreproducible. How much 
government regulation is actually built on irreproducible sci- 
ence? What has been the actual effect on government policy 
of irreproducible science? How much money has been wasted 
to comply with regulations that were founded on science that 


turned out to be junk? 


This is the $64 trillion dollar question. It is not easy to answer. Because 
the irreproducibility crisis has somany components, each of which could 
affect the research that is used to inform regulatory policy, we are faced 
with a maze of possible sources of misdirection. 


The authors of Shifting Sands include these just to begin with: 


malleable research plans; 

legally inaccessible data sets; 

opaque methodology and algorithms; 
undocumented data cleansing; 

inadequate or non-existent data archiving; 
flawed statistical methods, including p-hacking; 
publication bias that hides negative results; and 


political or disciplinary groupthink. 


Each of these could have far-reaching effects on government regu- 
latory policy—and for each of these, the critique, if well-argued, would 
most likely prove that a given piece of research had not been reproduced 
properly—not that it actually had failed to reproduce. (Studies can be 


made to “reproduce,” even if they don’t really.) To answer the question 
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thoroughly, one would need to reproduce, multiple times, to modern 
reproducibility standards, every piece of research that informs govern- 
mental regulatory policy. 

This should be done. But it is not within our means to do so. 

What the authors of Shifting Sands did instead was to reframe the 
question more narrowly. Governmental regulation is meant to clear a high 
barrier of proof. Regulations should be based ona very large a body of 
scientific research, the combined evidence of which provides sufficient 
certainty to justify reducing Americans’ liberty with a government regu- 
lation. What is at issue is not any particular piece of scientific research, 
but rather whether the entire body of research provides so great a degree 
of certainty as to justify regulation. Ifthe government issues a regulation 
based on a body of research that has been affected by the irreproducibility 
crisis so as to create the false impression of collective certainty (or extremely 
high probability), then, yes, the irreproducibility crisis has affected government 
policy by providing a spurious level of certainty to a body of research that justi- 
fies a government regulation. 

The justifiers of regulations based on flimsy or inadequate research 
often cite a version of what is known as the “precautionary principle.” 
This means that, rather than basing a regulation on science that has 
withstood rigorous tests of reproducibility, they base the regulation on 
the possibility that a scientific claim is accurate. They do this with the 
logic that it is too dangerous to wait for the actual validation of a hypoth- 
esis, and that a lower standard of reliability is necessary when dealing 
with matters that might involve severely adverse outcomes if no action 
is taken. 

This report does not deal with the precautionary principle, since it 
summons a conclusiveness that lies beyond the realm of actual science. 
We note, however, that invocation of the precautionary principle is not 
only non-scientific, but is also an inducement to accepting meretricious 
scientific practice and even fraud. 

The authors of Shifting Sands addressed the more narrowly framed 
question posed above. They applied a straightforward statistical test, 


Multiple Testing and Multiple Modeling (MTMM), and applied it to a body 
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of meta-analyses used to justify government research. MTMM provides 
a simple way to assess whether any body of research has been affected 
by publication bias, p-hacking, and/or HARKing (Hypothesizing After 
the Results were Known)—central components of the irreproducibil- 
ity crisis. In this first report, the authors applied this MTMM method 
to portions of the research underlying the Environmental Protection 
Agency’s (EPA) PM, ,regulations—the regulations based upon research 
affirming that particulate matter smaller than 2.5 microns in diam- 
eter has a deleterious effect on human health. The authors found that 
there was indeed strong evidence that these meta-analyses had been 
affected by publication bias, p-hacking, and/or HARKing. Their result 
provides strong evidence that elements of the irreproducibility crisis have led 
the Environmental Protection Agency to impose burdensome regulations with 
substantial economic impact based on insufficient scientific support. 

That’s the headline conclusion. But it leads to further questions. Why 
didn’t the EPA use this statistical technique long ago? How exactly does 
regulatory policy assess scientific research? What precise policy reforms 
does this research conclusion therefore suggest? 

The broadest answer to why the EPA hasn’t adopted this statistical 
technique for PM, ,regulations is that the entire discipline of environmen- 
tal epidemiology depends upon a series of assumptions and procedures, 
many of which give pause to professionals in different fields—and which 


should give pause to the layman as well. 


At the most fundamental statistical level, environmental 
epidemiology has not taken into account the recent chal- 
lenges posed to the very concept of statistical significance, or 
the procedures of probability of causation.® The Shifting Sands 


authors confined their critique to much narrower ground, 


5 W.M. Briggs, “Everything wrong with p-values under one roof,” in Beyond Traditional Probabilistic Methods 
in Economics, ECONVN 2019, Studies in Computational Intelligence, Volume 809, eds. Kreinovich V., Thach 
N., Trung N., Van Thanh D. (Cham, Switzerland: Springer, 2019), https://doi.org/10.1007/978-3-030-04200- 
4_2; Louis Anthony Cox, Jr., et al., Causal Analytics for Applied Risk Analysis (Cham, Switzerland: Springer, 
2018). 
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but readers should be aware that the statistical foundations 
underlying environmental epidemiology are by no means 
secure. 

Environmental epidemiology generally relies on statistical 
associations between air components and health outcomes, 
not on direct causal biological mechanisms. Statistical 
methods matter so much in the debate about regulatory 
policy because, usually, the only support for regulation lies 
in such statistical associations. 

Environmental epidemiology relies on unique data sets 
that are not publicly available. The nature of the discipline 
provides rationales for this procedure. Environmental 
epidemiology requires massive amounts of data collected 
over decades. It is difficult to collect this data even once— 
much of the data belongs to private organizations, and the 
data may pose athreat tothe privacy of the individuals from 
whom they were collected. Nevertheless, it is not in any 
strict sense science to rely on data which are not freely avail- 
able for inspection. 

Most relevantly for Shifting Sands, environmental epidemi- 
ology as a discipline has rejected the need to adjust results 
for multiple comparisons. In 1990, the lead editorial of 
Epidemiology bore the title, “No Adjustments Are Needed for 
Multiple Comparisons.”® The entire discipline of environ- 
mental epidemiology uses procedures that are guaranteed 
to produce false positives and rejects using well-established 
corrective procedures. MTMM tests have been available 
for decades. Genetic epidemiologists adopted them long 
ago. Environmental epidemiology rejects MTMM tests as a 
discipline—and because it does, the EPA can say it is simply 


following professional judgment. 


6 K. J. Rothman, “No adjustments are needed for multiple comparisons,” Epidemiology 1, 1 (1990): 43-46. 
https://www.jstor.org/stable/pdf/20065622.pdf?seq=1. 
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These are serious flaws—and I don’t mean by highlighting them to 
suggest that environmental epidemiologists haven’t done serious and 
successful work to keep themselves on the statistical straight-and-nar- 
row. A very large portion of environmental epidemiology consists of 
sophisticated and successful attempts to ensure that practitioners avoid 
the biased selection of data, and the discipline also has adopted several 
procedures to account for aspects of the irreproducibility crisis.’ The 
discipline does a great deal correctly, for which it should be commended. 
But the discipline isn’t perfect. It possesses blind-spots that amount to 
disciplinary groupthink. Americans must not simply defer to environ- 
mental epidemiology’s “professional consensus.” 

Yet that is what the EPA does—and, indeed, the federal government as 
a whole. The intention here was sensible—that government should seek to 
base its views on disinterested experts as the best way to provide author- 
itative information on which it should act. Yet there are several deep- 
rooted flaws in this system, which have become increasingly apparent 
in the decades since the government has developed an extensive scien- 


tific-regulatory complex. 


Government regulations do not account for disciplinary 
group-think. 

Government regulations do not account for the possibil- 
ity that a group of scientists and governmental regulators, 
working unconsciously or consciously, might act to skew 
the consideration of which science should be used to inform 
regulation. 

Government regulations define “best available science” by 
the “weight of evidence” standard. This is an arbitrary stan- 
dard, subject to conscious or unconscious manipulation by 
government regulators. It facilitates the effects of group- 


think and the skewed consideration of evidence. 


7 Scott M. Bartell, “Understanding and Mitigating the Replication Crisis, for Environmental Epidemiologists,” 
Current Environmental Health Reports 6,1 (2019): 8-15. https://doi.org/10.1007/s40572-019-0225-4, 
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Governmental regulations have failed to address fully the 
challenge of the irreproducibility crisis, which requires a 
much higher standard of transparency and rigor than was 
previously considered “best acceptable science.” 

The entire framework of seeking out disinterested exper- 
tise failed to take into account the inevitable effects of using 
scientific research to justify regulations that affect policy, 
have real-world effect, and become the subject of political 
debate and action. The political consequences have unavoid- 
ably had the effect of tempting political activists to skew 
both scientific research and the governmental means of 
weighing scientific research. Put another way, any formal 
system of assessment inevitably invites attempts to game it. 
To all this we may add the distorting effects of massive 
government funding of scientific research. The United States 
federal government is the largest single funder of scientific 
research in the world; its expectations affect not only the 
research it directly funds but also all research done in hopes 
of receiving federal funding. Government experts therefore 
have it in their power to create a skewed body of research, 


which they can then use to justify regulation. 


Shifting Sands casts a critical eye on the procedures of the field of envi- 
ronmental epidemiology, but it also casts a critical eye on governmental 
regulatory procedure, which has provided no check to the flaws of the 
environmental epidemiology discipline, and which is susceptible to great 
abuse. Shifting Sands is doing work that environmental epidemiologists 
and governmental regulators should have done decades ago. Their failure 
to do so is itself substantial evidence of the need for widespread reform, 
both among environmental epidemiologists and among governmental 
regulators. 

Before I go further, I should make clear the stakes of the “skew” in 


science that feeds regulation. 
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A vast amount of government regulation is based on scientific 
research affected by the irreproducibility crisis. This research includes 
such salient topics as racial disparity, implicit bias, climate change, and 
pollution regulation—and every aspect of science and social science that 
uses statistics. Climate change is the most fiercely debated subject, but 
the EPA’s pollution regulations are a close second—not least because 
American businesses must pay extraordinary amounts of money to 
comply with them. A 2020 report prepared for the Natural Resource 
Defense Council estimates that American air pollution regulations cost 
$120 billion per year—and we may take the estimate provided to an 
environmental advocacy group to be the lowest plausible number.* The 
economic consequences carry with them correspondingly weighty polit- 
ical corollaries: the EPA’s pollution regulations constitute a large propor- 
tion of the total power available to the federal government. The economic 
and political consequences of the EPA’s regulations are why we devoted 
our first Shifting Sands report to PM, ,regulation. 

PM, ,regulation is not even the largest single issue the irreproducibil- 
ity crisis has raised with EPA pollution regulations. The largest single 
issue is the Harvard Six Cities and American Cancer Society (ACS) stud- 
ies, which provide the basis for much of the EPA’s pollution regulation. 
All this data is confidential and not publicly available for full reproduc- 
tion. Any rigorous introduction of transparency requirements, applied 
retroactively, has the potential to disable much of the last generation of 
EPA pollution regulations. Any reproducibility reform has the potential to 
act as a precedent for extraordinarily consequential rollback of existing 
pollution regulations. 

These political consequences lie behind public arguments about 
“skew.” Critics of the EPA point to the influence of environmental activ- 
ists on overlapping groups of scientists and regulators, who collectively 
skew the results of science toward answers that would justify regulation. 


Defenders of the EPA, by contrast, see industry-employed scientists who 


8 Jason Price, et al., The Benefits and Costs of U.S. Air Pollution Regulations (Industrial Economics, Incorporat- 
ed, 2020), https://www.nrdc.org/sites/default/files/iec-benefits-costs-us-air-pollution-regulations-report. 
pdf. 
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skew science to undercut the case for regulation.’ The authors of Shifting 
Sands are among the former camp—as am I. I find it difficult to compre- 
hend how the gap between environmental science and environmental 
regulatory policy could have emerged absent such skew. 

Such arguments do not necessarily deny the good faith of those 
accused of skewing science. Humans are capable of good faith and bad 
faith at the same time: struggling for truth here, taking shortcuts there, 
and sometimes just knowingly advancing a falsehood on the presumption 
that the ends justify the means. The intrusion of bad faith is not the vice 
of only one party. Knowing that people are tempted, we need checks and 
balances, transparency, and something like an audit trail. Both conscious 
and unconscious bias play a part, as does sloppiness or deliberate use of 
bad scientific procedures to obtain preferred policy goals. The NAS would 
prefer to believe that the mistakes of the EPA derive from sloppiness and 
unconscious bias, and that a good-faith critique of its practices will be 
met with an equally good-faith response. 

Shifting Sands strengthens the case for policy reforms that would 
reduce the EPA’s current remit. The authors and I believe that this is the 
logical corollary of the current state of statistically informed science. I 
trust that we would favor the rigorous use of MTMM tests no matter what 
policy result they indicated, and I will endeavor to make good on that 
principle if MTMM tests emerge that argue against my preferred poli- 
cies. Those are the policy stakes of Shifting Sands. I hope that its scientific 
claims will be judged without reference to its likely policy consequences. 
The possible policy consequences have not pre-determined the report’s 
findings. We claim those findings are true, regardless of the conse- 
quences, and we invite others to reproduce our work. 

This report puts into layman’s language the results of several tech- 
nical studies by members of the Shifting Studies team of researchers, S. 
Stanley Young and Warren Kindzierski. Some of these studies have been 
accepted by peer-reviewed journals; others are under submission. As 


part of NAS’s own institutional commitment to reproducibility, Young 


9 Erik M. Conway and Naomi Oreskes, Merchants of Doubt: How a Handful of Scientists Obscured the Truth on 
Issues from Tobacco Smoke to Global Warming (New York: Bloomsbury Press, 2010). 
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and Kindzierski pre-registered the methods of their technical studies. 
And, of course, NAS’s support for these researchers explicitly guaran- 
teed their scholarly autonomy and the expectation that these scholars 
would publish freely, according to the demands of data, scientific rigor, 
and conscience. 

This report is only the first of four scheduled reports, each critiqu- 
ing different aspects of the scientific foundations of federal regulatory 
policy. We intend to publish these reports separately and as one long 
report, which will eliminate some necessary duplication in the material 
of each individual report. The NAS intends these four reports collectively 
to provide a substantive, wide-ranging answer to the question What has 
been the actual effect on government policy of irreproducible science? 

Iam deeply grateful to the support of many individuals who made 
Shifting Sands possible. The Arthur N. Rupe Foundation provided Shifting 
Sands’ funding—and, within the Rupe Foundation, Mark Henrie’s support 
and good will got this project off the ground and kept it flying. Four 
readers invested considerable time and thought to improve this report 
with their comments: Anonymous, William M. Briggs, David C. Bryant, 
and Louis Anthony Cox, Jr. David Randall, NAS’s Director of Research, 
provided staff coordination of Shifting Sands—and, of course, Stanley 
Young has served as Director of the Shifting Sands Project. Reports such 


as these rely on a multitude of individual, extraordinary talents. 
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Introduction 


Something Has Gone Wrong With Science 


n irreproducibility crisis afflicts a wide range of scientific and 

social-scientific disciplines, from public health to social 

psychology. Far too frequently, scientists cannot replicate 
claims made in published research.! Many improper scientific practices 
contribute to this crisis, including poor applied statistical methodology, 
bias in data reporting, fitting the hypotheses to the data, and endemic 
groupthink.’ Far too many scientists use improper scientific practices, 
including outright fraud. 

The irreproducibility crisis affects entire scientific disciplines. In 
2011, researchers at the National Institute of Statistical Sciences reported 
that not one of fifty-two claims in a body of observational studies could 
be replicated in randomized clinical trials.‘ In 2012, the biotechnology 
firm Amgen tried to reproduce 53 “landmark” scientific studies in hema- 
tology and oncology; it could only replicate six.® A 2015 Open Science 
Collaboration study that analyzed 100 experimental claims published 
in prominent psychological journals found that only 36% of the replica- 
tion research produced statistically significant results, versus 97% of the 
original studies.® 

This poses serious questions for policymakers. How many federal 
regulations reflect irreproducible, flawed, and unsound research? How 
many grant dollars have funded irreproducible research? In short, how 
many government regulations based on irreproducible claims harm the 


common good? 


Sarewitz (2012). 
Randall (2018). 
Ritchie (2020). 
Young (2011). 
Begley (2012). 
Open Science Collaboration (2015). 
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Professional groupthink among nutritionists led the Food and Drug 
Administration (FDA) to recommend that Americans cut their intake of 
fat, instead of sugar, to prevent obesity. The FDA’s guidelines were inef- 
fective or outright harmful: the American obesity rate skyrocketed from 
13.4% to 35.1% between 1960-62 and 2005-06, and further increased to 
45.8% by 2013-16." 

The Federal government spends millions of dollars to train its offi- 
cials to avoid “implicit bias.”* The Department of Education cited the 
same “implicit bias” to justify a Dear Colleague letter strong-arming 
local school districts to loosen their school discipline policies.® Yet when 
researchers tried to replicate the sociology research claiming to prove 
the existence of “implicit bias,” they couldn't.’ 

The Nuclear Regulatory Commission adopted the Linear-No- 
Threshold (LNT) dose-response model to justify extensive safety regu- 
lations to prevent cancer risks. Yet increasing numbers of experimenters 
have failed to reproduce the research that justified the LNT dose-re- 
sponse model," which has been used to support crippling regulations.” 

These examples are only the tip of the iceberg. Even that tip suggests 
that the irreproducibility crisis in science may have inflicted massive 
damage on federal regulatory policy. 

Americans need to know just how bad that damage is, and which 
reforms can best improve how government regulation assesses scientific 


research. 


The Shifting Sands Project 


The National Association of Scholars’ (NAS) project—Shifting Sands: 
Unsound Science and Unsafe Regulation examines how irreproducible 
science negatively affects select areas of government policy and regula- 


tion governed by different federal agencies. We also aim to demonstrate 


Faruque (2019); Leslie (2016); Meach (2018); NCHS (2008); NDSR (2020). 
E.g., DOJ (2016). 
9 Lhamon (2016). 
10 Blanton (2015); Carlsson (2016). 
11. Sanders (2010); Sanders (2017). 
12 Calabrese (2017); and see Young (2015); Obenchain (2017). 


Introduction 


procedures by which to detect irreproducible research. We believe 
government agencies should incorporate these procedures as they deter- 
mine what constitutes “best available science”—the bureaucratically 
defined standard that judges which research should inform government 
regulation." 

This first policy paper on PM, , Regulation focuses on irreproduc- 
ible research in the field of environmental epidemiology that informs 
the Environmental Protection Agency’s (EPA) policies and regulations. 
PM, , Regulation specifically focuses upon the scientific research that 
associates airborne fine particulate matter smaller than 2.5 microns in 
diameter (PM, ,) with health effects such as asthma and heart attacks. 
This research undergirds existing, economically burdensome EPA 
air pollution regulations. Future reports will examine irreproducible 
research that informs coronavirus policy at the Centers for Disease 
Control and Prevention (CDC), nutrition policy at the Food and Drug’s 
Administration’s (FDA) Center for Food Safety and Applied Nutrition, and 
implicit bias policy at the Department of Education. 

Shifting Sands aims to demonstrate that the irreproducibility crisis 
has affected so broad arange of government regulation and policy that 
government agencies should engage in thoroughgoing modernization 
of the procedures by which they judge “best available science.” Agency 
regulations should address all aspects of irreproducible research, includ- 


ing the inability to reproduce: 


the research processes of investigations; 
the results of investigations; and 


the interpretation of results." 


In Shifting Sands we will use a single analysis strategy for all of our 
policy papers: p-value plotting (a form of Multiple Testing and Multiple 


Modeling) as a way to demonstrate weaknesses in different agencies’ use 


13 Kuhn (2016). Federal law mandates that various agencies use best available science, but leaves the concept at 
best vaguely defined. Each agency provides its own definition of best available science. We use best available 
science in this report to refer either to the definition provided by the Environmental Protection Agency or to 
the overall use by federal agencies of best available science. We believe each use is clear in context. 

14 NASEM (2016). 
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of meta-analyses. Our common approach supports a comparative anal- 
ysis across different subject areas, while allowing for a focused exam- 
ination of one dimension of the impact of the irreproducibility crisis on 
government agencies’ policies and regulations. 

Future investigations into the effects of the irreproducibility crisis 


on regulatory policy might explore (for example) the consequences of: 


malleable research plans; 

legally inaccessible data sets; 

opaque methodology and algorithms; 

undocumented data cleansing; 

inadequate or non-existent data archiving; 

flawed statistical methods, including p-hacking (described 
below); 

publication bias that hides negative results; and 


political or disciplinary groupthink.'® 


Each of these effects can degrade the reliability of scientific research; 
jointly, they have greatly reduced public confidence in the reliability of 
scientific research that underpins federal regulatory policy. 

PM, , Regulation focuses on one subject matter—PM, , regulation—and 
one methodology—p-value plotting—to critique meta-analyses. The 


paper contains five sections: 


an introduction to the nature of the irreproducibility crisis; 
an explanation of p-value plotting; 


a history of the EPA’s PM, , regulation; 


Aw NS 


the results of our examination of environmental epidemiol- 
ogy meta-analyses; and 


5. ourrecommendations for policy changes. 


15 Cecil (1985). 
16 Randall (2018). 


Introduction 


Our policy recommendations include both specific technical recom- 
mendations directly following from our technical analysis, and broader 
policy recommendations to address the larger effects of the irreproduc- 


ibility crisis. 
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The Irreproducibility Crisis of Modern Science 


The Catastrophic Failure of Scientific 
Replication 


efore plunging into the gory details, let us briefly review the 
methods and procedures of science. The empirical scientist 
conducts controlled experiments and keeps accurate, unbi- 
ased records of all observable conditions at the time the experiment is 
conducted. Ifa researcher has discovered a genuinely new or previously 
unobserved natural phenomenon, other researchers—with access to his 
notes and some apparatus of their own devising—will be able to repro- 
duce or confirm the discovery. If sufficient corroboration is forthcoming, 
the scientific community eventually acknowledges that the phenomenon 
is real and adapts existing theory to accommodate the new observations. 
The validation of scientific truth requires replication or reproduction. 
Replicability (most applicable to the laboratory sciences) most commonly 
refers to obtaining an experiment’s results in an independent study, by 
a different investigator with different data, while reproducibility (most 
applicable to the observational sciences) refers to different investiga- 
tors using the same data, methods, and/or computer code to reach the 
same conclusion.” We may further subdivide reproducibility into methods 
reproducibility, results reproducibility, and inferential reproducibility.® 
Scientific knowledge only accrues as multiple independent investigators 
replicate and reproduce one another’s work." 
Yet today the scientific process of replication and reproduction has 
ceased to function properly. A vast proportion of the scientific claims in 


published literature have not been replicated or reproduced; credible 


17 NASEM (2016); NASEM (2019); Nosek (2020); Pellizzari (2017). 

18 Goodman (2016). 

19 We define reproducibility throughout our report as the testing and reproducing of an experiment’s underly- 
ing hypothesis using fresh data and/or a new method of analysis. Psychologists also conduct conceptual repli- 
cations, "the attempt to test the same theoretical process as an existing study, but that uses methods that 
vary in some way from the previous study” (Crandall 2016). The biomedical literature, however, does not refer 
to conceptual replication (NASEM 2016), and we have not innovated by using it in this report. We note the 
general importance and usefulness of conceptual replication, however, and we recommend that profession- 
als in other disciplines consider whether it can be adapted usefully for their own research procedures. 
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estimates are that a majority of these claims cannot be replicated or 
reproduced—that they are in fact false.”° An extraordinary number of 
scientific and social-scientific disciplines no longer reliably produce true 
results—a state of affairs commonly referred to as the irreproducibility 
crisis (reproducibility crisis, replication crisis). A substantial majority of 
1,500 active scientists recently surveyed by Nature called the current 
situation a crisis; 52% judged the situation a major crisis and another 
38% judged it “only” a minor crisis.”! The increasingly degraded ordinary 
procedures of modern science display the symptoms of catastrophic 
failure.” 

The scientific world’s dysfunctional professional incentives bear 


much of the blame for this catastrophic failure. 


The Scientific World’s Professional 
Incentives 


Scientists generally think of themselves as pure truth-seekers who 
seek to follow a scientific ethos roughly corresponding to Merton’s norms 
of universalism, communality, disinterestedness, and organized skep- 
ticism.** Public trust in scientists** generally derives from a belief that 
they adhere successfully to those norms. But this self-conception differs 
markedly from reality. 

Knowingly or unknowingly, scientists respond to economic and repu- 
tational incentives as they pursue their own self-interest.” Buchanan 
and Tullock’s work on public choice theory provides a good general 
framework. Politicians and civil servants (bureaucrats) act to maximize 
their self-interest rather than acting as disinterested servants of the 


public good. ** This general insight applies specifically to scientists, peer 


20 Halsey (2015); loannidis (2005); Randall (2018). 

21 Baker (2016). 

22 Archer (2020); Chawla (2020); Coleman (2019); Engber (2017); Gobry (2016); Hennon (2019); Herold (2018); 
loannidis (2005); Manuel (2019); NASEM (2019); Randall (2018); Yong (2018); Young (2018a); Zeeman (1976); 
Zimring (2019). 

23. Merton (1973); and see Anderson (2010); Kim (2018). 

24 Sample (2019). 

25 Buchanan (2004); Edwards (2017); Freese (2018); Glaeser (2006); and see Keller (2015); Shapin (1994). 

26 Buchanan (2004). 
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reviewers, and government experts.” The different participants in the 
scientific research system all serve their own interests as they follow the 
system’s incentives. 

Well-published university researchers earn tenure, promotion, 
lateral moves to more prestigious universities, salary increases, grants, 
professional reputation, and public esteem—above all, from publishing 
exciting, new, positive results. The same incentives affect journal editors, 
who receive acclaim for their journal, and personal reputational awards, 
by publishing exciting new research—even if the research has not been 
vetted thoroughly.** Grantors want to fund the same sort of exciting 
research—and government funders possess the added incentive that 
exciting research with positive results also supports the expansion of 
their organizational mission.*® American university administrations 
want to host grant-winning research, from which they profit by receiving 
“overhead” costs—frequently a majority of overall research grant costs.*° 

Allthese incentives reward published research with new positive claims— 
but not reproducible research. Researchers, editors, grantors, bureaucrats, 
university administrations—each has an incentive to seek out the excit- 
ing new research that draws money, status, and power, but few or no 
incentives to double check their work. Above all, they have little incentive 
to reproduce the research, to check that the exciting claim holds up— 
because if it does not, they will lose money, status, and prestige. 

Each member of the scientific research system, seeking to serve 
his or her own interest, engages in procedures guaranteed to inflate 
the production of exciting, but false research claims in peer-reviewed 
publications. Collectively, the scientific world’s professional incentives 
do not sufficiently reward reproducible research. We can measure the 
overall effect of the scientific world’s professional incentives by analyz- 


ing publication bias. 


27 Cecil (1985); Feinstein (1988). 

28 Ritchie (2020). 

29 Martino (2017); Lilienfeld (2017). 

30 Cordes (1998); Kaiser (2017); Roche (1994). 
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Academic Incentives versus Industrial Incentives 


Far too many academics and bureaucrats, and a distressingly large amount of the 
public, believe that university science is superior to industrial science. University 
science is believed to be disinterested; industrial science corrupted by the desire to 
make a profit. University science is believed to be accurate and reliable; industrial 
science is not.*! 

Our critique of the scientific world’s professional incentives is, above all, a critique 
of university science incentives. According to one study, zero out of fifty-two epide- 


miological claims could be replicated in randomized trials.?* According to another, 


only 36 of 100 of the most important psychology studies could be replicated.*® 
Nutritional research, a tissue of disproven claims such as coffee causes pancreatic 


cancer, has lost much of its public credibility.34 Academic science, both observa- 


tional and experimental, possesses astonishingly high error rates—and peer and 
editorial review of university research no longer provides effective quality control. 
Industry research is subject to far more effective quality control. Government-im- 
posed Good Laboratory Practice Standards, and their equivalents, apply to a broad 


range of industry research—and do not apply to university research.** Industry, 
moreover, is subject to the most effective quality control of all—a company’s prod- 
ucts must work, or it will go out of business.®” Both the profit incentive and gov- 
ernment regulation tend to make industrial science reliable; neither operates upon 
academic science. 


As we will see below, environmental epidemiology studies are largely based on uni- 
versity research. We should treat it with the same skepticism as we would industry 
research. 


Publication Bias: How Published Research 
Skews Toward False Positive Results 


The scientific world’s incentives for exciting research rather than 
reproducible research drastically affects which research scientists 
submit for publication. Scientists who try to build their careers on 
checking old findings or publishing negative results are unlikely to 
achieve professional success. The result is that scientists simply do not 


submit negative results for publication. Some negative results go to the 


31. E.g., Oreskes (2010). 

32 Young (2011). 

33. Open Science Collaboration (2015) 

34  Bidel (2013); Chambers (2017); Harris (2017); Hubbard (2015); MacMahon (1981). 
35 Feinstein (1988); Ogden (2011); Schachtman (2011); Schroter (2008); Smith (2010). 
36 E.g., EPA (n.d.). 

37 Taleb (2018). 
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file drawer. Others somehow turn into positive results as researchers, 
consciously or unconsciously, massage their data and their analyses. 
Neither do they perform or publish many replication studies, since the 
scientific world’s incentives do not reward those activities either.** 

We can measure this effect by anecdote. One co-author recently 
attended a conference where a young scientist stood up and said she 
spent six months trying unsuccessfully to replicate a literature claim. 
Her mentor said to move on—and that failed replication never entered 
the scientific literature. Individual papers also recount problems, such 
as difficulties encountered when correcting errors in peer-reviewed 
literature.*® We can quantify this skew by measuring publication bias— 
the skew in published research toward positive results compared with 
results present in the unpublished literature.*° 

A body of scientific literature ought to have a large number of negative 
results, or results with mixed and inconclusive results. When we exam- 
ine a given body of literature and find an overwhelmingly large number 
of positive results, especially when we check it against the unpublished 
literature and find a larger number of negative results, we have evidence 
that the discipline’s professional literature is skewed to magnify positive 
effects, or even create them out of whole cloth.*! 

As far back as 1987, a study of the medical literature on clinical trials 
showed a publication bias toward positive results: “Of the 178 completed 
unpublished randomized controlled trials (RCTs)* with atrend specified, 
26 (14%) favored the new therapy compared to 423 of 767 (55%) published 
reports.” Later studies provide further evidence that the phenomenon 


affects an extraordinarily wide range of fields, including: 


38 Randall (2018); Ritchie (2020). 

39 Allison (2016). 

40 Olson (2002); Randall (2018). 

41 Chambers (2017); Harris (2017); Hubbard (2015); Ritchie (2020). 

42 Weuse RCTs in the remainder of this report to refer both to “randomized controlled trials” and to “random- 
ized clinical trials”; both terms are common in the literature, and they are roughly equivalent. 

43 Dickersin (1987). 
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1. the social sciences generally, where “strong results are 40 
percentage points more likely to be published than are null 
results and 60 percentage points more likely to be written 
up;""* 

2. climate science, where “a survey of Science and Nature 
demonstrates that the likelihood that recent literature is 
not biased in a positive or negative direction is less than one 
in 5.2 x 10°16;"45 

3. psychology, where “the negative correlation between effect 
size and samples size, and the biased distribution of p values 
indicate pervasive publication bias in the entire field of 
psychology;”*6 

4. sociology, where “the hypothesis of no publication bias can 
be rejected at approximately the 1 in10 million level;”*” 

5. research on drug education, where “publication bias 
was identified in relation to a series of drug education 
reviews which have been very influential on subsequent 
research, policy and practice;”** and research on “mindful- 
ness-based mental health interventions,” where “108 (87%) 
of 124 published trials reported =1 positive outcome in the 
abstract, and 109 (88%) concluded that mindfulness-based 
therapy was effective, 1.6 times greater than the expected 


number of positive trials based on effect size.” 


Most relevantly for this report on PM, , regulation, publication bias 
has contributed heavily to the ratio of false positives to false negatives in 
published environmental epidemiology literature; this ratio is probably 
at least 20 to 1.°° 


44 Franco (2014). 

45 Michaels (2008). 

46 Kiihberger (2014). 

47 Gerber (2008). 

48 McCambridge (2007). 

49 Coronado-Montoya (2016). 
50 loannidis (2011). 
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What publication bias especially leads to is askew in favor of research 
that erroneously claims to have discovered a statistically significant rela- 


tionship in its data. 


What is Statistical Significance? 


The requirement that a research result be statistically significant has 
long been a convention of epidemiologic research.” In hundreds of jour- 
nals, in a wide variety of disciplines, you are much more likely to get 
published if you claim to have a statistically significant result. To under- 
stand the nature of the irreproducibility crisis, we must examine the 
nature of statistical significance. Researchers try to determine whether 
the relationships they study differ from what can be explained by chance 
alone by gathering data and applying hypothesis tests, also called tests of 
statistical significance. In practice, the hypothesis that forms the basis of a 
test of statistical significance is rarely the researcher’s original hypoth- 
esis that arelationship between two variables exists. Instead, scientists 
almost always test the hypothesis that no relationship exists between the 
relevant variables. Statisticians call this the null hypothesis. As a basis for 
statistical tests, the null hypothesis is mathematically precise in a way 
that the original hypothesis typically is not. A test of statistical signifi- 
cance yields a mathematical estimate of how well the data collected by 
the researcher supports the null hypothesis. This estimate is called a 
p-value. 

It is traditional in environmental epidemiology to use confidence 
intervals instead of p-values from a hypothesis test to demonstrate 
statistical significance. As both confidence intervals and p-values are 
constructed from the same data, they are interchangeable, and one can 
be estimated from the other. Our use of p-values in this report implies 
they can be (and are) estimated from the confidence intervals used in 


environmental epidemiology studies. 


51. NASEM (1991) 
52 Altman (2011a); Altman (2011b). 
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The Bell Curve and the P-Value: The Mathematical Background 


|? 


All “classical” statistical methods rely on the Central Limit Theorem, proved by 
Pierre-Simon Laplace in 1810. 


The theorem states that if a series of random trials are conducted, and if the results 
of the trials are independent and identically distributed, the resulting normalized 
distribution of actual results, when compared to the average, will approach an ideal- 
ized bell-shaped curve as the number of trials increases without limit. 


By the early twentieth century, as the industrial landscape came to be dominated by 
methods of mass production, the theorem found application in methods of industri- 
al quality control. Specifically, the p-test naturally arose in connection with the ques- 
tion “how likely is it that a manufactured part will depart so much from specifications 
that it won't fit well enough to be used in the final assemblage of parts?” The p-test, 
and similar statistics, became standard components of industrial quality control. 


It is noteworthy that during the first century or so after the Central Limit Theorem 
had been proved by Laplace, its application was restricted to actual physical mea- 
surements of inanimate objects. While philosophical grounds for questioning the 
assumption of independent and identically distributed errors existed (i.e., we can 
never know for certain that two random variables are identically distributed), the 
assumption seemed plausible enough when discussing measurements of length, or 
temperatures, or barometric pressures. 


Later in the twentieth century, to make their fields of inquiry appear more “scien- 
tific”, the Central Limit Theorem began to be applied to human data, even though 
nobody can possibly believe that any two human beings—the things now being 
measured—are truly independent and identical. The entire statistical basis of “ob- 
servational social science” rests on shaky supports, because it assumes the truth of 
a theorem that cannot be proved applicable to the observations that social scientists 
make. 


A p-value estimated from a confidence interval is anumber between 
zero and one, representing a probability based on the assumption that 
the null hypothesis is actually true.*® A very low p-value means that, if 
the null hypothesis is true, the researcher’s data are rather extreme— 
surprising, because a researcher’s formal thesis when conducting a null 
hypothesis test is that there is no association or difference between two 
groups. It should be rare for data to be so incompatible with the null 


hypothesis. But perhaps the null hypothesis is not true, in which case the 


53 Given the assumption that the null hypothesis is actually true, the p-value indicates the frequency with which 
the researcher, if he repeated his experiment by collecting new data, would expect to obtain data less com- 
patible with the null hypothesis than the data he actually found. A p-value of 0.20, for example, means that if 
the researcher repeated his research over and over in a world where the null hypothesis is true, only 20% of his 
results would be less compatible with the null hypothesis than the results he actually got. 
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researcher’s data would not be so surprising. If nothing is wrong with 
the researcher’s procedures for data collection and analysis, then the 
smaller the p-value, the less likely it is that the null hypothesis is correct. 

In other words: the smaller the p-value, the more reasonable it is to 
reject the null hypothesis and conclude that the relationship originally 
hypothesized by the researcher does exist between the variables in 
question. Conversely, the higher the p-value, and the more typical the 
researcher’s data would be in a world where the null hypothesis is true, 
the less reasonable it is to reject the null hypothesis. Thus, the p-value 
provides arough measure of the validity of the null hypothesis—and, by 
extension, of the researcher’s “real hypothesis” as well.** Or it would, if 
a statistically significant p-value had not become the gold standard for 


scientific publication.” 


Why Does Statistical Significance Matter? 


The government’s central role in science, both in funding scientific 
research and in using scientific research to justify regulation, further 
disseminated statistical significance throughout the academic world. 
Within a generation, statistical significance went from a useful short- 
hand that agricultural and industrial researchers used to judge whether 
to continue their current line of work, or switch to something new, toa 
prerequisite for regulation, government grants, tenure, and every other 
form of scientific prestige—and also, and crucially, the essential prereq- 
uisite for professional publication. 

Scientists’ incentive to produce positive, original results became 
an incentive to produce statistically significant results. Groupthink, 
frequently enforced via peer review and editorial selection, inhib- 
its publication of results that run counter to disciplinary or 
54 NASEM (2019); Randall (2018). 

55 Briggs, Trafimow, and others reject the use of p-values for analyzing and interpreting data. Briggs (2016); 
Briggs (2019); Trafimow (2018); and see Berger (1987); Cohen (1994). They argue that null hypothesis sig- 
nificance testing, p-values and the like are irredeemably flawed and that they should never be used in any 
way. We do not dispute this argument—but neither do we use it in this particular critique. As risk ratios and 
confidence intervals are common statistical measures in environmental epidemiology, our use of p-values 
is in any case as a complementary measure of confidence intervals for p-value plotting. McCormack (2013); 


Montgomery (2003). We do generally recommend that environmental epidemiologists address the critique 
by Briggs, etal. 
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political presuppositions.*®° Many more scientists use a variety of statis- 


tical practices, with more or less culpable carelessness, including: 


improper statistical methodology; 

consciously or unconsciously biased data manipulation that 
produces desired outcomes; 

choosing between multiple measures of a variable, select- 
ing those that provide statistically significant results, and 
ignoring those that do not; and 


using illegitimate manipulations of research techniques.” 


Still others run statistical analyses until they find a statistically 
significant result—and publish the one (likely spurious) result. Far too 
many researchers report their methods unclearly, and let the unin- 
formed reader assume they actually followed a rigorous scientific proce- 
dure.** A remarkably large number of researchers admit informally to 
one or more of these practices—which collectively are informally called 
p-hacking.® Significant evidence suggests that p-hacking is pervasive in 
an extraordinary number of scientific disciplines.®° HARKing is the most 


insidious form of p-hacking. 


HARKing: Exploratory Research Disguised 
as Confirmatory Research 


To HARK is to hypothesize after the results are known—to look at the data 
first and then come up with a hypothesis that provides a statistically 


significant result.“ Irreproducible research hypotheses produced by 


56 Ritchie (2020); and see Joseph (2020). 

57 Randall (2018). 

58 Chambers (2017); Harris (2017); Hubbard (2015); Randall (2018); Ritchie (2020). 

59 Fanelli (2009); John (2012); Randall (2018); Ritchie (2020); Schwarzkopf (2014); Simonsohn (2014). 
60 Bruns (2016); Head (2015); but see Hartgerink (2017); Tanner (2015). 

61 Randall (2018); Ritchie (2020). 
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HARKing send whole disciplines chasing down rabbit holes, as scientists 


interpret their follow-up research to conform toa highly tentative piece 


of exploratory research that was pretending to be confirmatory research. 


Scientific advance depends upon scientists maintaining a distinction 


between exploratory research and confirmatory research, precisely to 


avoid this mental trap. These two types of research should utilize entirely 


different procedures. HARKing conflates the two by pretending that a 


piece of exploratory research has really followed the procedures of 


confirmatory research.” 


Jaeger and Halliday provide a useful brief definition of exploratory 


and confirmatory research, and how they differ from one another: 


62 
63 


Explicit hypotheses tested with confirmatory research usual- 
ly do not spring from an intellectual void but instead are often 
gained through exploratory research. Thus exploratory ap- 
proaches to research can be used to generate hypotheses that 
later can be tested with confirmatory approaches. ... The end 
goal of exploratory research ... is to gain new insights, from 
which new hypotheses might be developed. ... Confirmatory 
research proceeds from a series of alternative, a priori hypoth- 
eses concerning some topic of interest, followed by the devel- 
opment of a research design (often experimental) to test those 
hypotheses, the gathering of data, analyses of the data, and 
ending with the researcher’s inductive inferences. Because 
most research programs must rely on inductive (rather than 
deductive) logic..., none of the alternative hypotheses can be 
proven to be true; the hypotheses can only be refuted or not 
refuted. Failing to refute one or more of the alternative hy- 
potheses leads the researcher, then, to gain some measure of 


confidence in the validity of those hypotheses.® 


Ritchie (2020). 
Jaeger (1998). 
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Exploratory research, in other words, has few predefined hypotheses. 
Researchers do not at first know what precisely they’re looking for, or 
even necessarily where to look for it. They “typically generate hypoth- 
eses post hoc rather than test a predefined hypothesis.”* Exploratory 
studies can easily raise thousands of separate scientific claims® and they 
possess an increased risk of finding false positive associations. 

Confirmatory research “tests predefined hypotheses usually derived 
from atheory or the results of previous studies that can be used to draw 
firm and often meaningful conclusions.”® Confirmatory studies ideally 
should focus on just one hypothesis, to provide a severe test of its valid- 
ity. In good confirmatory research, researchers control every significant 
variable. 

When multiple questions are at issue, researchers should use proce- 
dures suchas Multiple Testing and Multiple Modeling (MTMM)to control 
for experiment-wise error—the probability that at least one individual 
claim will register a false positive when you conduct multiple statistical 
tests.” (For further information about MTMM, see Appendix 1: Multiple 
Testing and Multiple Modeling (MTMM) and Epidemiology.) 

Researchers should state the hypothesis clearly, draft the research 
protocol carefully, and leave as little room for error as possible in execu- 
tion or interpretation. Properly conducted, confirmatory research is 
by its nature far less likely to find false positive associations than origi- 
nal research, and conclusions supported by confirmatory research are 
correspondingly more reliable. 

Researchers resort to HARKing—exploratory research that mimics 
confirmatory research—not only because it can increase their publica- 
tion rate but also because it can increase their prestige. HARKing scien- 
tists can gain the reputation for an overwhelmingly probable research 
result when all they have really done is set the stage for more follow-on 


false positive results or file-drawer negative results. 
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Another way to define HARKing is that, like p-hacking more gener- 
ally, it overfits data—it produces a model that conforms to random data. 
Consider, for example, The Life Project, a generations-long British cohort 
study about human development that provided data for innumerable 
professional articles in a range of social science and health disciplines, 
including 2,500 papers drawing solely on data about the cohort born 
in 1958.° These 2,500 articles have influenced a wide variety of public 
policy initiatives, by asserting that X cause is associated with Y effect— 
for example, that babies born on weekdays thrive better than babies born 
on weekends.”” 

But The Life Project never stated any research hypothesis in advance— 
it simply asked large numbers of questions and searched for possible 
associations. This is bound to produce false positives—statistical associa- 
tions produced by pure chance. It isthe essence of HARKing: exploratory 
research masquerading as confirmatory research. The sheer number 
of associations examined by The Life Project indicates that any claim 
of an association between a cause and an effect—e.g., weekday babies 
thrive, weekend babies don’t—should be considered to have no statistical 
support unless the p-value for an association has been evaluated using 
Multiple Testing and Multiple Modeling procedures. 

Food frequency questionnaires (FFQ) generally suffer the same frail- 
ties that afflict cohort studies such as The Life Project. Ina typical FFQ, 
researchers ask people to recall whether they have consumed various 
specified foods, and in what quantities. Researchers then ask whether 
they have experienced various specified health events at a much later 
date. These FFQs suffer from the frailties of human memory—but they 
also simply ask large numbers of questions and search for possible asso- 
ciations. For example, the 1985 Willett FFQ, notable in nutrition science 
literature, asked people questions about 61 different foods.” 

Two more recent FFQs, typical of the field, respectively ask people 
about 264 food items and 900 food items.” As with The Life Project, the 
68 _ Ritchie (2020). 

69 Pearson (2016). 
70 McKie (2016). 
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sheer number of possible associations, none of which confirmed a prior 
research hypothesis, are the definition of exploratory research that 
should be considered to have no statistical support unless the p-value 
for an association has been evaluated using MTMM procedures. 
HARKing, unfortunately, includes yet wider categories of research. 
When scientists preregister their research, they specify and publish 
their research plan in advance. All un-preregistered research can be 
susceptible to HARKing, as it allows researchers to transform their 
exploratory research into confirmatory research by looking at their 
data first and then constructing a hypothesis to fit the data, without 
informing peer reviewers that this is what they did.” In general, researchers 
too frequently fail to make clear distinctions between exploratory and 
confirmatory research, or to signal transparently to their readers the 


nature of their own research.” 


Consequences: Canonization 
of False Claims 


Publication bias, p-hacking, and HARKing collectively have seriously 
degraded scientific research as a whole. Head, et al. noted that p-hacking 
pervades virtually every scientific discipline.” Disciplines such as phys- 
ics, astronomy, and genome-wide association studies (@GWAS) appear 
to be exceptions to this generalization, but that is because they define 
significant p-values as several orders of magnitude smaller than 0.05. 
Elsewhere, the effects are stark. 

As early as 1975, Greenwald noted that only 6 percent of researchers 
were inclined to publish a negative result, whereas 60 percent were 
inclined to publish a positive result—a ratio of ~10 to 1.” Simonsohn, et 
al., note that replication does not necessarily support a claim if a field of 


research has been subject to data manipulation, or has failed to report 
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negative results.’* Many researchers consider the essentially improper 
procedure of testing many questions using the same data set to be “busi- 
ness as usual”—even though research that does not control for the size 
of the analysis search space cannot be considered to have any statistical 
support.” 

A false research claim can be canonized as the foundation for an 
entire body of literature that is uniformly false. Nissen undertook a theo- 
retical analysis and noted that it is possible for a false claim to become 
an established “truth’—and that it is especially likely in disciplines 
where publication bias skews heavily in favor of positive studies and 
against negative studies.*° We may recollect here the 2015 Open Science 
Collaboration study that failed to replicate 64 of 100 examined “canon- 
ical” psychology studies.*! Nissen’s claim, based upon a mathematical 
model and simulations, appears to have received experimental substan- 
tiation in the discipline of psychology. One co-author’s examination of a 
body of environmental epidemiology literature also supports his thesis.** 


It is all too likely to be true throughout the sciences and social sciences. 


What Gan Be Done? 


Modern scientific research’s irreproducibility crisis arises above 
all from extraordinary amounts of publication bias, p-hacking, and 
HARKing. We cannot tell exactly which pieces of research have been 
affected by these errors until scientists replicate every single piece of 
published research. Yet we do possess sophisticated statistical strategies 
that will allow us to diagnose specific claims, fields, and literatures that 
inform government regulation, so that we may provide a severe, repli- 
cable test that allows us to quantify the combined effect of p-hacking, 
HARKing, and publication bias on a claim or a field as a whole. 

One such method—an acid test for statistical skullduggery—is p-value 
plotting. 

78 Simonsohn (2014). 
79 Chambers (2017); Clyde (2000); Harris (2017); Hubbard (2015); Mayo (2018); Rothman (1990); Westfall (1993); 

Young (2018a). 

80 Nissen (2016); and see Akerlof (2018); Grimes (2018); Smaldino (2016). 


81 Open Science Collaboration (2015). 
82 Young (2019c). 
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P-Value Plotting: A Severe Test for Publication Bias, 
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Introduction 


e use p-value plotting to test whether a field has been affected 
by the irreproducibility crisis—by publication bias, p-hack- 
ing, and HARKing. In essence, we analyze meta-analyses of 
research and output their results on asimple plot that displays the distri- 


bution of p-value results: 


A literature unaffected by publication bias, p-hacking or 
HARKing should display its results as asingle line. 

A literature which has been affected by publication bias, 
p-hacking or HARKing should display bilinearity—divides 


into two separated lines. 


P-value plotting of meta-analyses results allows areader, at a glance, to 
determine whether there is circumstantial evidence that a body of scien- 
tific literature has been affected by the irreproducibility crisis. 

We will summarize here the statistical components of p-value plot- 
ting. We will begin by outlining a few basic elements of statistical meth- 
odology: counting; the definition and nature of p-values; and a simple 
p-value plotting method, which makes it relatively simple to evaluate 
a collection of p-values. We will then explain what meta-analyses are, 
and how they are used to inform government regulation. We will then 
explain how precisely p-value plotting of meta-analyses works, and what 


it reveals about the scientific literature it tests. 


Counting 


Counting can be used to identify which research papers in literature 


may suffer from the various biases described above. We should want to 


45 


46 


Shifting Sands 


know how many “questions” are under consideration in aresearch paper. 
In atypical environmental epidemiology paper, for example, there are 
usually several health outcomes at issue, such as deaths from all causes, 
heart attacks, other cardiovascular deaths, and pulmonary deaths. 
Researchers consider whether a risk factor, such as the concentration of 
particular air components, predicts any of these health outcomes—that 
is to say, whether the concentration of an air component may be “posi- 
tively” associated with a particular health outcomes.* 

When they study air pollution, environmental epidemiologists may 
analyze six air components: carbon monoxide (CO), nitrogen diox- 
ide (NO,), ozone (O,), sulfur dioxide (SO,), and two sizes of particulate 
matter—particulate matter 10 micrometers or less in diameter (PM,,) 
and particulate matter 2.5 micrometers or less in diameter (PM, ,). 
Each component is a predictor, each type of health effect is an outcome. 
Scientists may further analyze an association between a particular air 
component anda particular health outcome with reference to categories 
of analysis such as weather, age, and sex. Researchers call these further 
categories of analysis covariates; covariates may affect the strength of the 
association, but they are not the direct objects of study. 

An epidemiology paper considers a number of questions equal to the 
product of the number of outcomes (O) times the number of predictors 
(P) times 2 to the power of the number of covariates (C). In other words: 

the number of questions = O x Px 2° 

This formula approximates the number of statistical tests an epide- 
miology study performs. The larger the number of statistical tests, the 
easier it is to find a statistically significant association due solely to 


chance. 


P-values 


As we have summarized above, a null hypothesis significance test is 
a method of statistical inference in which a researcher tests a factor (or 


predictor) against a hypothesis of no association with an outcome. The 


83 “Air component” is more precise than “air pollutant,” which prejudges the scientific issues at question. 
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researcher uses an appropriate statistical test to attempt to disprove the 
null hypothesis. The researcher then converts the result to a p-value. 
The p-value is a value between O and 1 and it is a numerical measure of 
significance. The smaller the p-value, the more significant the result. 
Significance is the technical term for surprise. When we are conducting 
a null hypothesis significance test, we should expect no relationship 
between any particular predictor and any particular outcome. Any asso- 
ciation, any departure from the null hypothesis (random chance), should 
and does surprise us. 

If the p-value is small—conventionally in many disciplines, less than 
0.05—then the researcher may reject the null hypothesis and conclude 
the result is surprising and that there is indeed evidence for a significant 
relationship between a predictor and an outcome. If the p-value is large— 
conventionally, greater than 0.05—then the researcher should accept the 
null hypothesis and conclude there is nothing surprising and that there 
is no evidence for a significant relationship between a predictor and an 
outcome. 

But strong evidence is not dispositive (absolute) evidence. By definition, 
where p = 0.05, anull hypothesis that is true will be rejected, by chance, 
5% of the time. When this happens, it is called a false positive—false 
positive evidence for the research hypothesis (false evidence against 
the null hypothesis). The size of the experiment does not matter. When 
researchers compute a single p-value, both large and small studies have 
a 5% chance of producing a false positive result. 

Such studies, by definition, can also produce false negatives—false 
negative evidence against the research hypothesis (false evidence for 
the null hypothesis). In a world of pure science, false positives and false 
negatives would have equally negative effects on published research. But 
all the incentives in our summary of the Irreproducibility Crisis indicate 
that scientists vastly overproduce false positive results. We will focus 
here, therefore, on false positives—which far outnumber false negatives 


in the published scientific literature.** 


84 — loannidis (2011). 
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We will focus particularly on how and why conducting a large number 


of statistical tests produces many false positives by chance alone. 


Simulating Random p-values 


We can illustrate how alarge number of statistical tests produce false 
positives by chance alone by means of a simulated experiment. We can 
use acomputer to generate 100 pseudo-random numbers between 0 and 
1 that mimic p-values and enter them into a5 x 20 table. (See Figure 1.) 
These randomly generated p-values should be evenly distributed, with 
approximately 5 results between O and 0.05, 5 between 0.05 and 0.10, 
and so on—approximately, because a randomly generated sequence of 
numbers should not produce a perfectly uniform distribution. 

In Figure 1, we have simulated an environmental epidemiology study 
analyzing associations between air components and tumors. Remember, 
these numbers were picked at random. 

Each box in Figure | represents a different statistical test applied 
to associate a predictor (an air component) with an outcome (a health 
consequence). The Figure displays results of null hypothesis tests analyz- 
ing whether the annual incidence of 20 different tumors observed during a given 
year for 5 different air components are greater than an expected annual inci- 
dence rate of each tumor. Each box represents one out of 100 null hypoth- 
esis statistical tests—1 test for each of 20 tumors, applied to 5 different 
air components. The number in the box represents the p-value of each 
individual statistical test. 

This simulation contains four p-values that are less than 0.05: 
0.004, 0.038, 0.038 and 0.018. In other words, by sheer chance alone, 
a researcher could write and publish four professional articles based 
on the four “significant” results (p-values less than 0.05). Researchers 
are supposed to take account of these pitfalls (chance outcomes). There 
are standard procedures that can be used to prevent researchers from 


simply cherry-picking “significant” results.* But it is all too easy for a 


85 Westfall (1993). 
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researcher to set aside those standard procedures, to p-hack, and just 


report on and write a paper for each result with a nominally significant 


p-value. 
Figure 1: 100 Simulated p-values 

Tumor 1 74 3 4 5 

TO1 0.899 0.417 0.673 0.754 0.686 
TO2 0.299 0.349 0.944 0.405 0.878 
TO3 0.868 0.535 0.448 0.430 0.221 
TO4 0.439 0.897 0.930 0.500 0.257 
TOS 0.429 0.082 0.478 0.053 
T06 0.432 0.305 0.056 0.403 0.821 
TO7 0.982 0.707 0.460 0.789 0.956 
TO8 0.723 0.931 0.827 0.296 0.758 
TO9 0.174 0.982 0.277 0.970 0.366 
T10 0.117 0.339 0.281 0.746 0.419 
T11 0.433 0.640 0.313 0.310 0.482 
TH2 0.412 0.428 0.195 0.184 
T13 0.663 0.552 0.893 0.084 0.827 
T14 0.785 0.398 0.895 0.393 0.092 
T15 0.595 0.322 0.159 0.407 0.663 
T16 0.553 0.173 0.452 0.859 0.899 
T17 0.748 0.480 0.486 0.130 
T18 0.643 0.371 0.303 0.614 0.149 
T19 0.878 0.548 0.864 0.152 
T20 O559. 0.343 0.187 0.109 0.930 


P-hacking by Asking Multiple Questions 


As noted above, a standard form of p-hacking is for aresearcher to run 
statistical analyses until a statistically significant result appears—and 
publish the one (likely spurious) result. When researchers ask hundreds 
of questions, when they are free to use any number of statistical models to 


analyze associations, it is all too easy to engage in this form of p-hacking. 


50 


Shifting Sands 


In general, research based on multiple analyses of large complex data 
sets is especially susceptible to p-hacking, since a researcher can easily 
produce a p-value < 0.05 by chance alone.** Research that relies on 
combining large numbers of questions and computing multiple models 
is known as Multiple Testing and Multiple Modeling.* (See Appendix 1: 
Multiple Testing and Multiple Modeling (MTMM) and Epidemiology.) 

Confirmation bias compounds the difficulties of observing a chance 
p-value < 0.05. Confirmation bias, frequently triggered by HARKing 
that falsely conflates exploratory research with confirmatory research, 
influences researchers so that they are more likely to publish research 
that confirms a dominant scientific paradigm, such as the association of 
an air component witha health outcome, and less likely to publish results 
that contradict a dominant scientific paradigm. 

The following example, drawn from our earlier research into the rela- 
tionship of air components to health effects, illustrates how we should 
incorporate the role of analysis search space (counting) into this discus- 
sion. In Figure 2 we examine the estimated size of analysis search space 
for eight papers that appeared in a major environmental epidemiology 
journal.** Figure 2 gives the number of questions, models and search 
spaces for these papers listed by first author. 

In Figure 2: 


Questions = Outcomes x Predictors; 
Models = 2©°vrlates’ as a model can include a covariate, but 
need not; and 


Search Space = Questions x Models. 


Any researcher whose study contains a large search space could 
undertake, but not report, a wide range of statistical tests. The 


researcher also could use, but not report, different statistical models, 


86 Chambers (2017); Glaeser (2006); Harris (2017); Hubbard (2015); Ritchie (2020); Westfall (1993). 

87 Westfall (1993). 

88 Young (2017b). Researchers used the 8 papers listed in Figure 2 in two meta-analyses that examined studies 
asking the specific question whether air components are associated with heart attacks. 
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before selecting, using, and reporting the results. Figure 2 demonstrates 
just how large a search space is available for researchers to find and 


report results with a p-value less than 0.05. 


Figure 2: Estimated Size of Analysis Search Space, Eight Environ- 
mental Epidemiology Papers 


RowlD Author Year Questions Models Search 
Space 

1 Zanobetti 2005 3 128 384 

2. Zanobetti 2009 150 16 2,400 

3 Ye 2001 560 8 4,480 

4 Koken 2003 150 32 4,800 

5 Barnett 2006 56 256 14,336 

6 Linn 2000 120 128 15,360 

7 Mann 2003 96 512 49,152 

8 Rich 2010 175 1,024 179,200 


These papers are typical of environmental epidemiology studies. As 
will be shown later, the median search space across 70 environmen- 
tal epidemiology papers that we have recently examined is more than 
13,000. A typical environmental epidemiology study is expected to have 
by chance alone approximately 13,000 x 0.05 = 650 “statistically signifi- 


cant” results. 


P-value Plots 


Now we put together several concepts that we have introduced. When 
we conduct a null hypothesis statistical test, we can produce a single 
p-value that can fall anywhere in the interval from 0 to 1, and which is 
considered “statistically significant” in many disciplines when it is less 
than 0.05. We also know that researchers often look at many questions 
and compute many models using the same observational data set, and 
that this allows them to claim that a small p-value produced by chance 


substantiates a claim toa significant association. 
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Consider the following example.** Researchers made a claim that by 
eating breakfast cereal a woman is more likely to have a boy baby.®° The 
researchers conducted a food frequency questionnaire (FFQ) that asked 
pregnant women about their consumption of 131 foods at two different 
time points, one before conception and one just after the estimated date 
of conception. The FFQ posed a total of 262 questions. The researchers 
obtained a result with a p-value less than 0.05 and claimed they had 
discovered an association between maternal breakfast cereal consump- 
tion and fetal sex ratios. Their procedure made it highly likely that they 
had simply discovered a false positive association. 

We cannot prove that any one such result is a false positive, absent a 
series of replication experiments. But we can detect when a given result 
is likely to bea false positive, drawn from a larger body of questions that 
indicate randomness rather than a true positive association. 

The way to assess a given result is to make a p-value plot of the larger 
body of results that includes the individual result, and then plot the 
reported p-values of each of those results. We then use this p-value plot 
to examine how uniformly the p-values are spread over the interval 0 to 


1. We use the following steps to create the p-value plot. 


Rank-order the p-values from smallest to largest. 


Plot the p-values against the integers: 1, 2, 3, ... 


When we have created the p-value plot, we interpret it like this: 


A p-value plot that forms approximately a 45-degree line 
(i.e., slope = 1) provides evidence of randomness—a liter- 
ature that supports the null hypothesis of no significant 
association. 

A p-value plot that forms approximately a line with slope 
< 1, where most of the p-values are small (less than 0.05), 
provides evidence for a real effect—a literature that 


supports a statistically significant association. 


89 Young (2009). 
90 Mathews (2008). 
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A p-value plot that exhibits bilinearity—that divides into two 
lines—provides evidence of publication bias, p-hacking, 
and/or HARKing.*! 


Bilinear behavior (bilinearity) in arank-ordered p-value plot suggests 
two linear relationships with a clear breakpoint to represent the data of 


interest. These relationships include a mixture of: 


a set of p-values—which are all less than 0.05—displaying a 
linear slope much less than 45°, and 
a set of p-values—which may range from near O to 1— 


displaying a slope approximately 45°.” 


Why does a plotted 45-degree line of p-value results provide evidence 
of randomness? When a researcher conducts a series of statistical tests 
to test a hypothesis, and there is no significant association, the individ- 
ual results ought to appear anywhere in the interval O to 1. When we 
rank these p-values and plot them against the integers 1, 2, ..., they will 
produce a 45-degree line that depicts a uniform distribution of results. The 
differences between the individual results, in other words, differ from 
one another regularly, and produce collectively a uniform distribution 
of results. 

Whenever we plot a body of linked p-value results, and the results plot 
toa45-degree line, that is evidence that an individual result is the result of 
arandom distribution of results—that even a putatively significant asso- 
ciation is really only afluke result, a false positive, where the evidence as 
a whole supports the null hypothesis of no significant association. 


We may take this as evidence of randomness whether we apply it to: 


aseries of individual studies focused on one question, 


91 Young (2019a); Young (2019b); Young (2019c) 

92 Abilinear p-value plot cannot be reduced to a mathematical function. The question is one of logic. If some 
researchers manipulate data and get small p-values and others do not and get p-values that fit a 45-degree 
line, it does not make logical sense to presume there is some smooth mathematical/functional form that fits 
both components of the mixture. A bilinear p-value plot, rather, is a strong suggestion of a two-component 
mixture of results—p-hacked and random, or true effect and flawed studies. 
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* aseries of tests that emerge by uncontrolled testing of a set 
of different predictors and different outcomes, 


* or series of meta-analyses.” 


The null hypothesis assumption is that there is no significant asso- 
ciation. This presumption of a random outcome, of no significant asso- 
ciation, must be positively defeated in a hypothesis test in order to make 
a claim of a significant, surprising result.** The corollary is that an indi- 
vidual result of a significant association can only be taken as reliable if 
any body of results to which it belongs also positively defeats the p-value 
plot of a 45-degree line that depicts a uniform distribution of results.* 
(For further details for constructing a p-value plot, see Appendix 2: 
Constructing P-value Plots.) 

Let us return to the research linking breakfast cereal with increased 
conception of baby boys. That statistical association was drawn from 262 
total questions, each of which produced its own p-value. When we plot 
the reported p-values of all 262 of those questions, in Figure 3 below, the 
result is a line of slope 1 (approximately). 


Figure 3: P-value Plot, 262 P-values, Drawn from Food Frequency 
Questionnaire, Questions Concerning Boy Baby Conception” 
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93 Schweder and Spjetvoll applied p-value plotting to evaluate many different questions. Schweder (1982). We 
apply p-value plotting to evaluate meta-analyses devoted to a single question; we believe our application of 
p-value plotting is original. 

94 Fisher (1925); Fisher (1935); Mayo (2018). 

95 An individual p-value that is extraordinarily small ( = far below 0.05), after adjustment for multiple testing, 
also has potential evidentiary value—but this occurs rarely in well-designed and executed environmental 
epidemiology studies that control properly for bias and MTMM. 

96 Young (2009). We acquired the data from the original researchers, who to our knowledge have not yet made 
it public. Interested scholars who wish to reproduce our analysis should contact the original researchers. 
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This line supports the presumption of randomness as a 45-degree 
line starting at the origin 0,0 would fit the data very well. The small 
p-value, less than 0.05, registered for the association between breakfast 
cereal consumption and boy-baby conception, represents a false positive 
finding. 

P-value plotting likewise reveals randomness, no significant associ- 
ation, when applied in Figure 4 to a meta-analysis that combined data 
from 69 questions drawn from 40 observational studies. The claim being 
evaluated in the meta-analysis was whether long-term exercise training of 
elderly is positively associated with greater mortality and morbidity (increased 


accidents and falls and hospitalization due to accidents and falls). 


Figure 4: P-value Plot, 69 Questions Drawn From 40 Observational 
Studies, Meta-analysis of Observational Data Sets Analyzing Associ- 
ation Between Elderly Long-term Exercise Training and Mortality and 

Morbidity Risk®’ 


p-value 


Rank order 


Figure 4, as Figure 3, plots the p-values as a sloped line from left to 
right at approximately 45-degrees, and therefore supports the presump- 
tion of randomness. Note that Figure 4 contains four p-values less than 
0.05, as well as several p-values close to 1.000. The p-values below p = 
0.05 are most likely false positives. 


97 De Souto Barreto (2019). 
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These claims are purely statistical. Researchers can, and will, argue 
that discipline-specific information justifies treating their particular 
claim for a statistical association—that “relevant biological knowledge,” 
for example, supports the claim that there truly is an association between 
breakfast cereal consumption and boy-baby conception.” 

We recognize the possibility where statisticians and disciplinary 
specialists talk past one another and refuse to engage with the substance 
of one another’s arguments. But we urge disciplinary specialists, and the 
public at large, to consider how extraordinarily unlikely it is for a p-value 
plot indicating randomness to itself be a false positive. The counter- 
argument that a particular result truly registers a significant association 
needs to refute the chances against such a 45 degree line appearing if the 
individual results were not the consequence of selecting false positives 
for publication. 

Such a counter-argument should also consider that p-value plotting 
does register true effects. We applied the same method to produce a 
p-value plot in Figure 5 of studies that examined a smoking-lung cancer 


association. 


Figure 5: P-value Plot, 102 Studies, Association of 


Smoking and Squamous Cell Carcinoma of the Lungs” 
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In this case, the p-value plot did not form a roughly 45-degree line, 
with uniform p-value distribution over the interval. Instead it formed 
an almost horizontal line, with the vast majority of the results well below 
p = 0.01. Only 3 out of 102 p-values were above p = 0.05. One outlying 
p-value was just below 0.40—which reminds us that even where there is 
a true strong relationship, a few studies may produce false negatives. Our 
p-value plot provides evidence that the studies associating smoking and 


lung cancer had discovered atrue association. 


Bilinear P-value Plots 


Our method also registers bilinear results (divides into two lines). 
In Figure 6, we plotted studies that analyze associations between fine 
particulate matter and the risk of preterm birth or term low birth weight. 
A 45-degree line as in Figures 3 and 4 indicates randomness, no effect, and 
therefore strongly suggests that researchers have indulged in HARKing if 
they claim a positive effect. A bilinear shape instead suggests the possi- 
bility of publication bias, p-hacking, and/or HARKing—although there 


remains some possibility of a true effect. 


Figure 6: P-value Plot, 23 Studies, Association of Fine Particulate Mat- 
ter (PM, ,) and the Risk of Preterm Birth or Term Low Birth Weight !°° 
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As we shall explain, sucha bilinear plot should usually be interpreted 
as providing evidence that bias described above has affected a given field, 
albeit not as strong as a 45-degree line provides evidence of no effect. 
Still, researchers would have good cause to query a claim of an associ- 
ation between fine particulate matter and the risk of preterm birth or 
term low birth weight, evenifatrue effect cannot be absolutely ruled out. 

Figure 5 demonstrates that our method can detect true associations— 
it will not come back with a 45-degree line no matter what data you feed 
into it. When it does detect randomness, as in Figure 3 and 4, the infer- 
ence is that a particular result is likely to be random, and that the claimed 
result has failed a statistical test that a true positive body of research passes. 

When a p-value plot exhibits bilinearity, as in Figure 6, that provides 
evidence that there are 1) missing p-values—missing results, which ought 
to complete the (null) line; and/or 2) p-hacked results, which have driven 
results down from what they should be to results smaller than the profes- 
sionally designated level of statistical significance. Bilinearity, in other 
words, provides evidence that a field has been subject to publication 
bias—either that negative results have gone into the file drawer or that 
published results are the result of p-hacking, and/or HARKing. 

Our test is useful for assessing the scientific literature precisely 
because it provides reasonable possibilities for both success and fail- 
ure.’ We should emphasize that this method is not meant to present an 
unanswerable disproof of any study or literature to which it is applied. 
As noted above, the authors of the claim associating maternal breakfast 
cereal consumption with altered fetal sex ratios made a counter-argu- 
ment to our critique, and to the argument for randomness displayed in 
Figure 3. We urge all scholars and interested citizens to examine these 
counter-arguments. Scientific discovery proceeds by the scrutiny of such 


arguments and counter-arguments.'” 


101 Mayo (2018). 
102 Mathews (2009). 
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We claim that our p-value plot method provides a useful test to check 
claims against the null-hypothesis. Any such claims ought as a general 
rule to survive the test of our method—particularly if they are to be used 
to influence government policy. 

P-value plots are an essential component of the rigorous statistical 
testing that must now be considered the scientific gold standard. Even 
meta-analyses exclusively relying on studies of RCTs, which use admira- 
bly rigorous study designs,’ can display bilinear p-value plots. P-value 
plotting provides evidence that while RCT studies may be necessary to 
produce rigorous science, they are not sufficient unless they have been 
subjected to equally rigorous statistical testing. (For three additional 
examples of p-value plots that register bilinear results, see Appendix 3: 


Common Examples of Bilinear P-value Plot Behavior.) 


Meta-Analyses: Definition and Use 


Now that we have explained how p-value plotting works, we define 
what a meta-analysis is and how they are used together to evaluate 
the reliability of a claim. A meta-analysis is a systematic procedure 
for statistically combining data from multiple published papers that 
address a common research question—for example, whether a specific 
factor is a likely cause or origin of a health outcome such as a stroke or 
a heart attack. Scientists can conduct meta-analyses relatively easily. 
Researchers use computer programs to search the published literature, 
sort quickly through titles, abstracts, and full-texts of papers, and select 
ca. 10-20 papers from the hundreds to thousands of papers initially iden- 
tified as candidates for meta-analysis. 

The set of papers chosen for a single meta-analysis itself requires 
careful study so as to select properly comparable and on-topic papers 


and include all the relevant studies.’ In the well-established cottage 


103 Grossman (2005). 
104 Chen (2013); Glass (1976); Stroup (2000). 
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industry of meta-analysis studies, a skilled team of 5-15 researchers can 
turn out one meta-analysis per week.'® Researchers publish approxi- 
mately 5,000 meta-analysis studies per year.!%° 

Many government agencies now depend upon meta-analyses. The 
flood of papers on any given topic makes it difficult even for an expert to 
stay abreast of all the literature, and a meta-analysis provides a conve- 
nient way to digest the results of many individual papers. Government 
agencies also wish to base their policy on a broad spectrum of rigorous, 
comparable research, rather than just one or a few individual studies. 
Meta-analyses offer the promise that government agencies are indeed 
using such research. Meta-analyses also offer what appears to be an 
impartial protocol that can provide a safeguard against the danger of 
biased expert judgment. 

Yet meta-analyses have not proved to be a cure-all. Meta-analyses 
can themselves be affected by publication bias, and almost every other 
form of irreproducibility-crisis research error that affects individual 
studies.’°’ For example, when researchers vary meta-analyses’ inclusion 
and exclusion criteria—the criteria stating which studies to include in 
a meta-analysis and which to exclude—they can produce wildly vary- 
ing results.'°* In other words, researchers who do not pre-register their 
inclusion and exclusion criteria can HARK their meta-analyses. 

Meta-analyses’ reliability also depends on their base studies’ reliabil- 
ity—and if those have been affected by publication bias or other infirmi- 
ties (e.g., failure to apply MTMM to control for experiment-wise error), 
then the meta-analyses they are conducting are no more than Garbage In, 
Garbage Out (GIGO). Funding bias can affect meta-analyses—and where 
government agencies are concerned, it is worth emphasizing that govern- 


ment funding can produce substantial funding bias.’ 


105 De Vrieze (2018). 

106 loannidis (2016). 

107 Rothstein (2005); Thornton (2000). 
108 Palpacuer (2019). 

109 Cecil (1985); Wojick (2015). 
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Meta-Analyses: Evaluation 


Qualitative study of meta-analyses is a burgeoning field, which should 
repay further development." We will focus here, however, on the quan- 
titative, statistical study of meta-analyses’ validity—an approach made 
possible by the extraordinary growth in the number of meta-analyses. 

When we refer to aresearch ‘claim’ in our discussion below, we mean 
that a study makes a claim of a positive association between a factor 
investigated and an outcome based on finding small p-values (less than 
0.05) in their research. As it is a statistical claim being made by the 
meta-analysis researchers, we can evaluate the reliability of the claim 
from a statistical point-of-view. We can use p-value plotting to evaluate 
published meta-analyses, as we did in Figures 3—6, and thereby uncover 
problems in the way these meta-analyses have been interpreted. 

When we plot an approximately 45-degree line, we acquire good 
evidence for the null hypothesis. When we plot bilinearity, we acquire 
evidence of publication bias, p-hacking, and/or HARKing—and signif- 
icant evidence against any claim of a consistent overall positive asso- 
ciation between cause and outcome across the studies used in that 
particular meta-analysis. At the very least, we have acquired evidence 
that some unidentified covariate complicates the putative relationship. 

We noted above that government agencies rely heavily on meta-analy- 
ses to justify regulation. They do not as yet subject these meta-analyses to 
p-value plotting—and we believe that their failure to do so denies thema 
very useful tool for assessing the validity of such meta-analyses. P-value 
plotting that establishes bilinearity does not disprove the meta-analysis. 
The significant associations could be true; the random results in error. 
But given the known incentives toward publication bias, p-hacking, and 
HARKing, bilinearity says we should take meta-analyses’ claims to have 
detected positive associations with a big grain of salt. 

Where government regulatory policy depends on the claim that such 
positive associations exist, the existence of a bilinear p-value plot provides 
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a very strong argument that a body of literature has not actually proved the 
existence of an association to the level that justifies government regulation. A 
bilinear p-value plot provides a good rule of thumb: a government agency 
has not yet acquired the rigorously tested body of scientific research 
needed to justify regulation. 

P-value plotting isn’t itself a cure-all. The procedure might not be 
able to tell when an entire literature consists of biased results. P-value plot- 
ting cannot detect every form of systematic error. But it is a useful tool, 
which allows us to detect a strong likelihood that a substantial portion of 
government regulation has been built on inconsistent science. 

We note here that p-value plotting is not the only means available by 
which to detect publication bias, p-hacking, and HARKing in meta-anal- 
yses. Scientists have come up with a broad variety of statistical tests to 
account for such frailties in base studies as they compute meta-analyses. 
Unfortunately, publication bias and questionable research procedures in 
base studies severely degrade the utility of existing means of detection.'” 
We proffer p-value plotting not as the first means to detect publication 
bias and p-hacking in meta-analyses, but as a better means than alter- 


natives which have proven ineffective. 
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The EPA and PM, . 


general concern with air quality in America emerged in the 

late 1940s and early 1950s, precipitated by chronic smog in 

Los Angeles, the freak but deadly combination of weather and 
airborne emissions in 1948 at Donora, Pennsylvania, and London’s Great 
Smog of 1952. These three incidents, and the newly revived recollection 
of a similarly deadly freak combination of weather and airborne emis- 
sions in 1930 in the Meuse Valley in Belgium, sparked increasing concern 
witharange of airborne components, such as oxides of sulfur, nitric and 
nitrous oxides (i.e., NO and NO,), and particulate matter—independent of 
whether they contributed to visually perceptible smoke. Smog becamea 
household world. The federal government’s Public Health Service turned 
its attention to the general effects of airborne components on health." 

A series of local and state regulatory initiatives, particularly in Los 
Angeles and California, led to the establishment of a federal regula- 
tory structure in 1963—The Clean Air Act. This was followed by further 
federal measures, notably the Motor Vehicle Control Act (1965), the 
Clean Air Act Amendments (1966), the Air Quality Act (1967), the Clean 
Air Act Amendments (1970), and the establishment of the Environmental 
Protection Agency (1970). These collectively, but particularly the last two, 
set the ground rules for the regulatory structure that has persisted to the 
present day." 

The EPA must develop air quality criteria for specific airborne compo- 
nents, informed by expert opinion, and describe their effects singly and 
in combination on the health and welfare of American citizens. The EPA 
must set National Ambient Air Quality Standards (NAAQS), asa yardstick 
by which states and localities can measure their own air quality, andasa 
legal requirement to enforce reduction of airborne components. 

These NAAQS are for carbon monoxide (CO), hydrocarbons (HC), lead 
(Pb), nitric oxides (NO,), ozone (O,), particulate matter (PM), and oxides 
of sulfur (SO,). It must also review the NAAQS periodically, to determine 
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whether further regulation is in order. The EPA acts in coordination 
with different federal government departments, such as the Office of 
Management and Budget (OMB) and the Council on Environmental 


Quality (CEQ), but it plays the leading role." 


Costs 


The costs of insufficiently substantiated regulation can become exorbitant. As 
a recent example, consider estimated costs requiring ships to use “cleaner fuel” 
with less sulfur, so as to reduce SO, emissions.''® The EPA argues that the move to 
low-sulfur ship fuel could save up to 14,000 American and Canadian lives every year. 
The inferred health-related benefits are estimated to be as much as US$110 billion/ 
year in 2020. The EU claims these regulations will prevent 50,000 premature deaths. 
On the other hand, the cost of these regulations is estimated at US$3.2 billion/year 


in 2020 and may rise to a total of one trillion dollars through the year 2050.1” 


Yet a growing body of research fails to support the EPA and the EU’s mortality 


claims.'8 This research provides evidence that SO, in ambient air has no significant 


association with mortality," heart attacks,””° asthma,!*! or lung cancer.” We may 
be paying up to $1 trillion dollars to satisfy a regulation with no real scientific foun- 
dation. 


The EPA issues an extraordinary number of regulations, which affect every area of 
the economy and constrict everyday freedoms. If the long-term cost of one regula- 
tion on one industry amounts to one trillion dollars, the cost of many regulations on 
every industry is uncountable trillions. The EPA should only impose such costly reg- 
ulations using fully reproducible science that has survived a battery of severe tests. 


The EPA slowly imposed increasingly restrictive regulations and 
regularly updated NAAQS (Figure 7). These required the accumulation 
of data on both air quality and on health effects—even forwarded by the 
EPA’s sponsorship of research that would underpin the emerging regula- 
tions. The EPA only shifted from regulation of Total Suspended Particles 
(TSP) to PM,, in 1987. It did not regulate PM, , explicitly until 1997. The EPA 


is far older than its current regulatory regime for particulate matter.!”? 
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122 Acharjee (in preparation); Young (in preparation). 
23 Bachmann (2007); Cao (2013). 


The EPA and PM2.5 


Figure 7: Summary of particulate matter (PM) National Ambient Air 
Quality Standards (NAAQS) implemented by the U.S. Environmental 


Protection Agency (U.S. EPA)!*4 
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The current regulations depend on statistical analysis. The EPA and 


environmental epidemiologists, as a discipline, have not established 
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direct causal biological mechanisms that link air components and health 
outcomes’*°—save for freak conditions such as prevailed in the Meuse 
Valley (1930), Donora (1948), and London (1952).!*6 

Rather, they have relied on statistical analyses to discern significant 
associations between air components and health outcomes. These associ- 
ations provide the “proof” that an air component, alone or in association 
with other elements, causes damage to health and to the economy. The 
debate about whether or not the EPA should make a particular regula- 
tory decision raises questions central to the irreproducibility crisis— 
data accuracy, research protocols, statistical analyses, publication bias, 
sponsorship bias, etc. 

We note here a conundrum. By an extraordinary number of indi- 
cators, Americans’ general health has risen remarkably over the last 
several generations.'*’ The gravest recent harm to Americans’ life expec- 
tancy has been the opioid epidemic, concentrated among poor white 
Americans—an effect entirely unrelated to the remit of the EPA.’* Yet 
the EPA produces an ever-lengthening catalogue of studies of things 
that harm Americans’ health.’”® Feinstein charged as far back as 1988 
that much of the research suggesting specific health-harms must be the 
result of misuse of statistics and computers, data dredging to produce 
dire literature of statistically significant effects that square badly with 


evidence of general improvements in health and life expectancy.’*° 
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Who Benefits? 


The EPA appears to have acted selectively in its approach to the health effects 
of PM,,.. In the early 1990s two different set of researchers examined the health 
effects of PM, , and mortality. Dockery et al. published a study that appeared in 
the New England Journal of Medicine in 1993 that found a significant association 
between small particulate matter in outdoor air and death rates; the Dockery 
study received apparent support from the study by his colleagues in Pope et al. 
in 1995." Yet Styer et al. published a study in 1995 that drew upon a far larger 
data set and found no association between particulate matter in outdoor air and 


deaths.!** 


The EPA had funded both Dockery and Styer—but at that point, the EPA ceased 
funding the Styer group and began intensive support of the Dockery-Pope line of 
research, despite substantial skepticism from the scientific community about the 
accuracy of the Dockery-Pope data.!*? The Health Effects Institute (HEI), half-fund- 
ed by the EPA, then re-analyzed the data and supported the claims made in the 
Dockery and Pope studies—but without ever making their data set available to the 


public or to the EPA.'3* HEI neither requested nor examined the Styer data set.'® 


The EPA has continued to pay more attention to research that supports regulation. 
In 2009, the EPA‘s Integrated Science Assessment for Particulate Matter repeatedly 
cited meta-analyses that supported EPA regulatory policy, while failing to cite key 
negative papers in the literature or citing perfunctorily by way of brusque and 


insufficiently justified dismissal.!° 


More narrowly, the EPA constructed its PM, , regulation from 1997 to 
the present day upon a series of studies in the generation from the 1970s 
to the 1990s that sought to establish: 1) significant associations between 
PM, .and various health effects; and 2) that the health effects were them- 
selves substantial enough to justify EPA regulation.'’” The regulation 
from 1997 onward relies on research drawn from the famous Harvard 
Six Cities and American Cancer Society (ACS) studies—whose original 
data, on claimed grounds of privacy and confidentiality, have never been 
made transparently available to other researchers for reproduction or 
critique.'*® 

We may note here that data for environmental epidemiology is diffi- 


cult to collect—it is an observational science rather than a laboratory 


131 Dockery (1993); Pope (1995). 

132 Styer (1995). 

133 Milloy (2016); Moolgavkar (1995); Phalen (2004). 

134 Krewski (2000); Milloy (2016). 

135 Krewski (2000). 

136 Chay (2003); Enstrom (2005); ISAPM (2009); Janes (2007); Styer (1995); Young (2018b). 
137 Cao (2013). 

138 Dockery (1993); Pope (1995); and note especially the critique in Enstrom (2017). 


69 


70 


Shifting Sands 


one, and one that requires data sets of hundreds of thousands of indi- 
viduals, sometimes collected over decades, to make any sort of defini- 
tive statement. The EPA delayed more rigorous regulation of PM, , fora 
generation precisely so as to assemble a data set that they thought would 
justify such regulation. The EPA and its advocates argue that the diffi- 
culty of collecting such data justifies allowing the EPA to base regulation 
on inaccessible data. 

However, it is precisely because the data are so difficult to collect that 
it is vital to have access to the one available data set, so that it may be 
subjected to a battery of rigorous (severe) tests to see if the analysis is 
sound. The burden of proof for transparency and reproducibility lies 
with the research the EPA uses as the basis for its regulations. 

When EPA regulation is based on inaccessible data, there are numer- 
ous potential weaknesses. We cannot fully account for interaction 
effects—the effects of “confounding” variables on health effects, such as 
temperature,'** atmospheric inversion, or varying demographic predis- 
positions to sickness and mortality.“° We cannot examine the base 
information itself for reliability. Death certificates, for example, are not 
entirely reliable sources of information.” 

Neither do we possess the data that can begin to allow us to determine 
what are the precise causal mechanisms—the biological mechanisms—by 
which an airborne component actually induces a health risk.” (For more 
discussion about current status of unanswered PM, .-mortality causal 
mechanisms and several negative studies that invalidate PM, .-mortality 
causation, see Appendix 4: PM, ,-Mortality Causality—Incomplete 
Evidence. This evidence, drawn from published literature, does not 
support a PM, .-mortality causal mechanism.) 

The Harvard Six Cities/ACS data that underpin the Dockery/Pope 
research has never been subjected to Multiple Testing and Multiple 
Modeling (MTMM), even though an adjustment for MTMM with the wide- 
ly-used SAS statistical software could easily be applied to the data." 
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Since Dockery and Pope never made the data publicly available, no inde- 
pendent, critical researcher can subject the Harvard Six Cities/ACS data 
to the severe test of MTMM. Since analysis of newer and much larger data 
sets has found no effects of air quality on mortality, skepticism about the 
Dockery/ Pope results is warranted.'* 

We would prefer to analyze the EPA’s PM, , policy by subjecting the 
data underlying the Harvard Six Cities/ACS studies to further scrutiny. 
Unfortunately, such scrutiny is impossible because the data’s owners 
have barred public access on the claimed grounds of privacy and confi- 
dentiality. Yet p-value plotting provides a way to apply a severe test to 
results where the data remain hidden, and to assess whether publication 
bias, p-hacking, or HARKing has produced the body of literature that 
“justified” the EPA’s PM, , regulation. 
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Evaluation of PM, . Research 
Underlying EPA Regulation 


Introduction 


nvironmental epidemiological researchers regularly engage in 

massive hypothesis tests without making Multiple Testing and 

Multiple Modeling (MTMM) statistical corrections. These tests 
have associated air quality components with a remarkable number of 
possible adverse health effects. 

These effects include but are not limited to: all-cause mortality; 
cause-specific mortality; all-cause morbidity; low birth weight; miscar- 
riage; COPD exacerbation; inflammation; pulmonary complication; 
autism; obesity; depression; atopic dermatitis; impaired vestibular 
function (sense of balance); metabolic disorders; suicide, mental health 
and well-being; ADHD (Attention Deficit/Hyperactivity Disorder); 
respiratory complication; pneumonia and acute respiratory infection; 
reproductive outcomes; high blood pressure; lung and other cancers; and 
accelerated brain aging." 

Below we summarize four technical investigations about associations 
between fine particulate matter (PM, .) (andin some cases other air qual- 
ity components) in ambient air with various health effects. These effects 
include all-cause mortality, heart attacks and two asthma effects— 
development of asthma and asthma attacks. The investigation on heart 
attacks using p-value plots has already been published. Preprints of 
the two forthcoming investigations on asthma have been deposited in the 
open-access repository arXiv, as has the preprint of a shorter investi- 
gation on all-cause mortality."* 

These investigations, whose research protocol we pre-registered,'“® 
have been or will be submitted to professional journals for peer review 
145 Samet (2019). 
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and publication. Data used in these studies are publicly available. We 
approached these investigations by focusing on meta-analyses that 
ask the specific question whether inferred exposure to PM, , (and other air 
quality components) is associated with increases in all-cause mortality, heart 
attacks and asthma. We present strong statistical evidence that the EPA 
has developed policy and regulated PM, , based upon a field of epidemi- 
ology research substantially affected by some combination of sampling 
bias, publication bias, p-hacking and/or HARKing. 

Our research also demonstrates more generally how p-value plots 
may be used to evaluate the reliability of studies making research claims 
about any air quality component-health outcome associations. Here we 
present counts and p-value plots for these investigations and then we 
interpret and discuss them. (For supporting information for these inves- 
tigations, see Appendix 5: Supporting Information for Investigations of 
PM, Health Effect Association. This information includes explaining 


how these counts were made.) 


P-Value Plots 


All-cause Mortality 


The very multiplicity of claimed associations between air components 
and adverse health effects° suggests that even larger numbers of possi- 
ble hypothesis tests lie behind claims of air quality component associa- 
tions with all-cause mortality. Claims associating air quality components 
with all-cause mortality require more than usual care when making 
MTMM corrections. 

The standard six air quality components that environmental epide- 
miology researchers investigate for associations with all-cause mortal- 
ity are CO (carbon monoxide), NO, (nitrogen dioxide), O, (ozone), PM, . 
(particulate matter, small), PM,, (particulate matter, not as small) and 


SO, (sulfur dioxide). Our investigation of the reliability of data from 


150 Samet (2019). 
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studies used in a meta-analysis of short-term air quality—all-cause 


mortality associations focused on NO,, O,, PM, ,,and PM,, as “causes”, and 


2.57 
all-cause and cause-specific mortalities as “outcomes.”*! 

In this large-scale systematic review and meta-analysis, researchers 
reviewed 1,632 papers and selected 196 for analysis. The researchers 
claimed that, “This study found evidence of a positive association between 


short-term exposure to PM, 


PM, , NO,, and O, and all-cause mortality, and 
between PM, and PM, ,and cardiovascular, respiratory and cerebrovascular 
mortality.” The researchers provided risk ratios with confidence limits 


and from these we produced a p-value plot (Figure 8). 


Figure 8: P-value plot, All-Cause Mortality and PM, ,"” 
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Heart Attacks 


Much environmental epidemiology literature claims that poor 
air quality can trigger a heart attack.’ The standard six air quality 
components that environmental epidemiology researchers identify as 
151 Orellano (2020); Young (submitted). 
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heart attack triggers are CO, NO,, O,, PM,,, PM,, and SO,. We recently 


2.5? 
published an investigation of the reliability of data from studies used in 
a meta-analysis of short-term air quality—heart attack associations.'™* 

The meta-analysis, which identified and drew data from 34 studies 
that statistically examined associations between the air quality compo- 
nents and heart attacks,’ claimed that, “All the main air pollutants, with 
the exception of ozone, were significantly associated with a near-term increase 
in MI [myocardial infarction, aka heart attack] risk.” 

We counted the number of outcomes, predictors, covariates and time 
lags'® used in each of the 34 studies in order to estimate the number 
of statistical tests performed. Summary statistics for these counts are 
shown in Figure 9. We also developed p-value plots for the six air quality 


components (shown in Figure 10). 


Figure 9: Summary statistics, Analysis search space (number of statis- 
tical tests), 34 studies, associations between air quality components 
and heart attacks 


Statistic Space1 Space2 Space3 
minimum 8 8 240 
lower quartile 23 64 2,496 
median 36 256 12,288 
upper quartile 109 1,024 58,368 
maximum 540 16,384 4,587,520 


In Figure 9, Minimum = minimum count; lower quartile = 25th percentile count; 
median = 50th percentile count; upper quartile = 75th percentile count; maximum = 
maximum count; Space 1 = Outcomes x Predictors x Lags; Space 2 = 2°""#"*s; Space 
3 (analysis search space or number of statistical tests) = Space 1 x Space 2. 


154 The original meta-analysis is Mustafic (2012); our critique is Young (2019b). 

155 Young (2019b). We catalogued the 34 studies in the supplemental information to Young (2019b). 

156 Time lags are an analytical category specific to environmental epidemiology time series studies. A lag as- 
sumes that an air component may be associated with an adverse health effect some number of days after an 
exposure event. For example, a PM, exposure event occurring five days previous might induce a heart attack 
today. 


Evaluation of PM2.5 Research Underlying EPA Regulation 


Figure 10: P-value plots, Six air quality components, Air quality—heart 
attack meta-analysis 
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Development of Asthma (Cohort Studies) 


We investigated a meta-analysis examining the association of ambient 
exposure to NO, and PM, , early in life with development of asthma later 
in life.’ The meta-analysis drew data from 19 published studies involving 
18 different cohort populations—13 cohorts for NO, and 5 cohorts for PM, .. 
The meta-analysis claimed that, “The results are consistent with an effect of 
outdoor air pollution on asthma incidence.” 

We counted the number of outcomes, predictors and covariates avail- 
able in each of the 19 studies in order to estimate the number of statistical 
tests performed. Summary statistics for these counts are shown in Figure 


11. We developed a combined p-value plot for NO, and PM, , (Figure 12). 


157 Anderson (2013); Young (in progress). The 19 studies are listed in Anderson (2013). 


77 


Shifting Sands 


Figure 11: Summary statistics, Analysis search space (number of 
statistical tests), 19 studies, associations between NO,, PM,, and 
development of asthma 


Statistic Space1 Space2 Space3 
minimum 2 32 96 

lower quartile 15 96 1,536 
median 24 256 13,824 
upper quartile 84 3,072 221,184 
maximum 160 262,144 42,000,000 


In Figure 11, Minimum = minimum count; lower quartile = 25th percentile count; 
median = 50th percentile count; upper quartile = 75th percentile count; maximum 
= maximum count; Space 1 = Outcomes x Predictors; Space 2 = 2°°"#"*s; Space 3 
(analysis search space or number of statistical tests) = Space 1 x Space 2. 


Figure 12: 18 Cohort Studies, P-Value Plot!® 
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Note: solid circles (:) are NO, p-values; open circles (°) are PM, , p-values. 
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Asthma Attacks (Time-Series Studies) 


Much environmental epidemiology literature claims that poor air 
quality can trigger asthma attacks.’ We investigated a meta-analysis 
examining the association of short-term ambient exposure to six air 


components (CO, NO,, O,, PM, ., PM,,, and SO,) with asthma attacks. 


2.5? 10? 


The meta-analysis drew data from 87 time-series studies that statisti- 
cally examined associations among air quality components and asthma 
attacks’ and claimed that “Short-term exposures to air pollutants account 
for increased risks of asthma [attack]-related emergency room visits and hospi- 
talizations that constitute a considerable healthcare utilization and socioeco- 
nomic burden.” 

We counted the number of outcomes, predictors, covariates and time 
lags available in 17 studies randomly selected from the list of 87 (or 20%) 
in order to estimate the number of statistical tests performed. Summary 
statistics for these counts are shown in Figure 13. We made p-value plots 


for the six air quality components (Figure 14). 


Figure 13: Summary statistics, Analysis search space (number of sta- 
tistical tests), 17 randomly selected studies, associations between air 
quality components and asthma attack 


Statistic Space1 Space2 Space3 
minimum 6 4 96 
lower quartile 60 16 oso 
median 160 32 15,360 
upper quartile 288 256 40,960 
maximum 5,120 512 89,600 


In Figure 13, Minimum = minimum count; lower quartile = 25th percentile count; 
median = 50th percentile count; upper quartile = 75th percentile count; maximum = 
maximum count; Space 1 = Outcomes x Predictors x Lags; Space 2 = 2°°""s; Space 
3 (analysis search space or number of statistical tests) = Space 1 x Space 2. 


159 Bowatte (2015); Favarato (2014); Mehta (2013); Sheehan (2016); Takenoue (2012); Weinmayr (2010). 
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Figure 14: P-value plots, Six air quality components, Air quality—asthma attack 
meta-analysis 
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Rank-order 
Discussion 
Counts 


We provided search space counts as estimates of the number of statis- 
tical tests performed in base studies for the three air component-health 
effect meta-analyses we investigated. These include: Figure 9 (heart 
attack), Figure 11 (development of asthma) and Figure 13 (asthma attack). 
There is known flexibility available to researchers to undertake arange 
of statistical tests and use different statistical models during observa- 
tional studies, and then to select, use, and report only a portion of the 
test and model results.’* Wicherts refers to this flexibility as “researcher 
degrees of freedom” in the psychological sciences.’ 

Base papers with large search space counts suggest the use of large 
numbers of statistical tests and statistical models and the potential 


for researchers to search through and report only a portion of their 


161 Young (2019b). 
162 Wicherts (2016). 
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results (i.e., positive, statistically significant results). In looking at the 
Space 3 counts in Figures 9, 11, and 13, large numbers of statistical tests 
performed are acommon feature of base studies used in these meta-anal- 
yses. For example, estimates of the median number of statistical tests 
performed in the meta-analyses base papers are 12,288 (34 base papers 
for heart attack), 13,824 (19 base papers for development of asthma) and 
15,360 (17 base papers for asthma attack). 

Overall, these three meta-analysis investigations involved estimating 
search space counts from 70 environmental epidemiology base papers 
investigating associations between air quality components and health 
effects. The estimated median number of statistical tests performed for 
these 70 base papers is 13,056. Given a study with 13,056 statistical tests, 
we may expect see as many as 0.05 x 13,056 = 653 results with p-values 
less than 0.05 due to chance alone. Another noteworthy feature of these 
base studies is that large numbers of statistical test results may have 
gone unreported in these studies—presumably, results with p-values 


greater than 0.05. 


P-value Plots 


We restate how the p-value plots were interpreted: 


A p-value plot that forms approximately a 45-degree line 
provides evidence of randomness—supporting the null 
hypothesis of no significant association. 

A p-value plot that forms approximately a line with slope 
< 1, where most of the p-values are small (less than 0.05), 
provides evidence for a real effect—supporting a statisti- 
cally significant association. 

A p-value plot that exhibits bilinearity—that divides into two 
lines—provides evidence of publication bias, p-hacking, 
and/or HARKing.'* 
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When we examine the specific p-value plots that we created (Figures 
8, 10, 12, and 14), the patterns strikingly resemble the bilinear patterns 
depicted in Figure 6 and the figures shown in Appendix 3: Common 
Examples of Bilinear P-value Plot Behavior. The bilinear patterns register 
heterogeneous sets of p-values—two-component mixtures. 

These plots imply a mixture of random results and results plausibly 
due to publication bias, p-hacking, and/or HARKing. There is no consis- 
tent effect across studies that we would expect if the data supported 
a true positive association that represents an alternative hypothesis 
outcome (i.e., Figure 5). 

For example, the PM,, all-cause mortality bilinear p-value plot 
(Figure 8) presents 29 p-values—13 of these p-values are less than 0.05 
and 16 are greater. The meta-analysis’ claim that “this study found evidence 
of a positive association between short-term exposure to PM, , and all-cause 
mortality” is not supported by convincing evidence. This result warrants 
further scrutiny of the researchers’ other claims. For example, p-value 


plots for PM,,, NO, and O,, not shown here, are also bilinear. 


10° 

The six air component-heart attack p-value plots shown in Figure 
10 resemble bilinear patterns. Likewise, bilinearity is a feature of the 
PM,» 


component-asthma attack p-value plots (Figure 14). Bilinear patterns 


NO,-asthma development p-value plot (Figure 12) and the six air 


argue that any claim for a general implication of an air components asa 
cause of heart attack, development of asthma, or asthma attack is with- 
out statistical support. We believe that these investigations strengthen 
our larger argument that all such claims of associations warrant more 


skepticism than they have so far received. 
Findings 


We do not claim that our results absolutely disprove claims there are 
associations between PM, , and all-cause mortality, heart attacks, or 


asthma complications. Yet they are certainly consistent with the general 


Evaluation of PM2.5 Research Underlying EPA Regulation 


claim that false-positive results from publication bias, p-hacking, and/ 
or HARKing are common features of the biomedical literature today, 
including the broad range of risk factor-chronic disease research.!** 

Given the large numbers of statistical tests available in the environ- 
mental epidemiology base studies used for three meta-analyses we inves- 
tigated (Figures 9, 11, 13), p-hacking certainly cannot be ruled out as an 
explanation for the small p-values, less than 0.05, shown in Figures 10, 
12and 14. 

We do claim that air component-health effect associations ought to 
survive a battery of severe but passable tests, such as p-value plots that 
we have undertaken here. This battery of tests should also include aresa- 
mpling-based multiplicity analysis (multiple testing and multiple analysis) 
of the base studies used in a meta-analysis. 

Researchers always have the burden of proof to provide a significant 
association—and government regulators have the higher barrier to prove 
that an entire body of the best available science supports the require- 
ment to substitute regulation for liberty. It should disturb scientists, 
bureaucrats, politicians, and citizens alike that current governmental 
procedures to test for the best available science have failed to pass our 
test—which is simple to execute, and now widely used in a variety of 
scientific disciplines.!© 

The EPA might not have regulated PM, , at allif they had applied more 
rigorous scientific reproducibility requirements to research that they 
used to justify their regulations. Certainly, the EPA policy and regula- 
tory process would have followed a substantially different course. If the 
EPA were to apply more rigorous scientific reproducibility standards 
going forward— including p-value plotting in their regular assessments 
of scientific research—it would vastly improve the scientific reliability of 


future PM, , regulation. 
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Conclusion 


Overview 


any people ask, what proof is there that the irreproducibility 
crisis has actually affected existing government regulations? 
Our p-value plots provide a direct answer to that question. 
Wherever we apply our p-value plots to a meta-analysis and producea 
bilinear relationship, we should presume that the questionable research 
procedures, p-hacking and HARKing that constitute the irreproducibil- 
ity crisis have rendered the underlying research untrustworthy. We 
have applied our p-value plots to research that provides the justification 


for EPA regulation of PM, ., and bilinear lines appeared. These bilinear 


2.57 
relationships provide strong evidence that the government has based 
its regulations on unreliable research affected by the irreproducibility 
crisis. 

EPA regulations rely on environmental epidemiological literature, 
without applying rigorous tests for reproducibility—and without consid- 
ering the environmental epidemiology discipline’s general refusal to take 
account of the need for Multiple Testing and Multiple Modeling. Such 
rigorous tests are needed not least because earlier generations of envi- 
ronmental epidemiologists have already identified the low-hanging fruit. 

These include massive statistical correlations between risk factors 
and health outcomes—e.g., the connection between smoking and lung 
cancer. Modern environmental epidemiologists habitually seek out small 
but (nominally) significant risk factors and health outcome associations. 
These practices render their research susceptible to registering false 
positives as real results, and to risk mistaking an improperly controlled 
covariable for a positive association. 

Environmental epidemiologists are aware of these difficulties, but, 
despite having remade their discipline into an exercise in applied statis- 


tics, they do little to control for bias, p-hacking, and other well-known 


85 


86 


Shifting Sands 


statistical errors.’ The intellectual leaders of their discipline have 
positively counseled against taking measures to avoid these pitfalls.’” 
But environmental epidemiologists, andthe bureaucrats who depend on 
their work to support regulations, proceed as a field with self-confidence, 
and an insufficient sense of the need for a humble awareness of just how 
much statistics must remain an exercise in measuring uncertainty rather 
than establishing certainty.’ Their results do not possess an adequate 
scientific foundation. Their so-called “facts” are built on shifting sands, 
not on the solid rock of real, transparent, and critically reviewed scien- 
tific inquiry. 

Our study shows how one particular set of statistical techniques, 
simple counting and p-value plots, can provide a severe test of environ- 
mental epidemiology meta-analyses to detect p-hacking and other frail- 
ties in the underlying scholarly literature. We have used these techniques 
to demonstrate that meta-analyses associating PM, . and other air qual- 
ity components with mortality, heart attacks and asthma attacks fail this 
severe test. 

Our study also demonstrates negligence on the part of both environ- 
mental epidemiologists and the EPA. The discipline of environmental 
epidemiology has failed to adopt a simple statistical procedure to test 
their research. The EPA has failed to require that research justifying 
regulation be subjected to such atest. These persistent failures under- 
cut confidence in their professional capacities, as researchers and as 
regulators. 

These failures also suggest that, more broadly, the standard proce- 
dures of environmental epidemiology are insufficiently rigorous. These 
failures also suggest that current EPA regulatory policy, both in general 
and with regards to PM, , regulation in particular, fails to test with suffi- 
cient rigor the research used to justify regulation. The EPA also makes 
comprehensive tests impossible by failing to require public access to 
data sets used to justify regulation. The EPA’s failure to use this particu- 


lar testing procedure is symptomatic of a larger failure to incorporate a 


166 Clyde (2000); Westfall (1993); Young (2011); Young (2017). 
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full range of severe statistical tests—without which the results of a statis- 
tically-founded discipline such as environmental epidemiology cannot 
qualify as the best available science. 

Both environmental epidemiology as a discipline (including foun- 
dations, journals, and tenure committees) and the EPA ought to adopt 
a range of reforms to improve the reproducibility of their research. 
However, we direct our recommendations to the EPA, and more broadly 
to federal regulatory and granting agencies. 

We have reluctantly come to the conclusion that scientists will not 
change their practices unless the federal government credibly warns 
them that it will withhold government grant dollars until they adopt 
stringent reproducibility reforms. We have also come to the conclusion 
that federal regulators will not adopt stringent new tests of science 
underlying regulation unless they are explicitly required to do so. 

Yet while these recommendations are framed as suggestions for 
government requirements, we still urge scientists to adopt these reforms 


voluntarily. 


Recommendations to the EPA 


All these recommendations are intended to bring EPA methodol- 
ogies up to the level of best available science, as per the mandate of The 


Information Quality Act.’ 


1. The EPA should adopt resampling methods (Multiple 
Testing and Multiple Modeling) as part of its standard 
battery of tests applied to environmental epidemiology 


research. 


We have critiqued at length the standard procedure of environmental 
epidemiology meta-analysis, which has proven susceptible to statistical 
frailties. The corollary of this critique is that the EPA should adopt the 


standard procedure, elaborated in a work partly written by one of our 
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co-authors more than a quarter century ago,'”° to control for environ- 
mental epidemiology’s Multiple Testing and Multiple Modeling (MTMM) 
problem. 

This resampling-based multiple testing procedure already has been 
incorporated into a variety of disciplines, including genomics™ and 
economics,'” and has been shown to be optimal for a broad class of test- 
ing problems.'” Any discipline using statistics can incorporate these 
procedures into their regular tests. Any government agency that relies 
on scientific research can require the use of such procedures to test 
scientific research, before it is used to justify regulation, or qualify as 
best available science. The EPA should do so. 

The EPA, in other words, should only rely on base studies and 
meta-analyses that use a resampling methodology (MTMM) to correct 
their results. The EPA should also subject all such research to indepen- 
dent MTMM analyses. 

MTMM analysis is not the only tool that can be used to adjust an anal- 
ysis for p-hacking and other forms of biased sampling. But we believe it 
is auseful tool, which can easily be adopted by regulators and research- 
ers to apply a severe test to scientific research. We do not propose it as 
a cure-all—but as a tool useful in itself, and also as an example: that a 
variety of reproducibility reforms can practicably be introduced into 
the ordinary procedures of professional and governmental judgment of 


scientific validity. 


2. The EPA should rely for regulation exclusively on 
meta-analyses that use tests to take account of endemic 
questionable research procedures, p-hacking and 
HARKing. 


Questionable research procedures, p-hacking and HARKing are 


endemic in environmental epidemiology—as they are in many disciplines 
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affected by the irreproducibility crisis. Since so many base studies are 
unreliable, the meta-analyses which collate these base studies likewise 
have become unreliable—Garbage In, Garbage Out. 

When the EPA uses meta-analyses or a systematic review to justify 
regulation, it should only rely on meta-analyses that conduct rigorous 
tests to detect whether a field’s base studies have been affected by ques- 
tionable research procedures, p-hacking and HARKing. While we will 
not prescribe further particular methods here, we state that existing 
tests are not sufficient.“ The EPA should adopt tests substantially more 


stringent than those they currently accept. 


3. The EPA should redo its assessment of base studies more 
broadly to take account of endemic questionable research 


procedures, p-hacking and HARKing. 


The different aspects of the irreproducibility crisis—questionable 
research procedures, p-hacking and HARKing—thrive opportunistically 
within research structures that allow scientists arbitrary control over 
revealing their questions and their data. When we remove that control, 
we remove much of the possibility that the irreproducibility crisis will 
affect government regulation. 

The EPA should take the initiative generally to assess base studies 
with an eye to rooting out questionable research procedures, p-hack- 
ing and HARKing. The EPA can best remove that control by requiring 
preregistration of research that justifies regulation and public access to 


research data used to justify regulations. 


4. The EPA should require preregistration and registered 


reports of allresearch that informs regulation. 


Preregistration and registered reports will constrain the ability of 
scientists to HARK, and generally inhibit p-hacking and questionable 
research procedures. Preregistration and registered reports are not 
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cures. Determined scientists in time undoubtedly will devise methods 
to undermine the effectiveness of these precautions. But preregistra- 
tion and registered reports will substantially improve the reliability of 
research used by the EPA. The EPA should stipulate that all preregistra- 
tion and registered reports must detail the MTMM methods that will be 


used to assess results. 


5. The EPA should also require public access to all research 


data used to justify regulations. 


We have provided substantial corroborative evidence that the irre- 
producibility crisis has affected research used to justify EPA regulation; 
we cannot provide direct evidence because the EPA does not permit or 
facilitate public access to that research data. The EPA should require 
that all research used to justify regulation must provide public access to 
the underlying research data. The EPA should direct all necessary fund- 
ing to ensure de-identification of human data.’ and provide an adequate 
means to address all privacy and confidentiality concerns. But these are 
challenges that the EPA can and must meet, not convenient obstacles that 


prevent public access.'” 


6. The EPA should consider the more radical reform of fund- 


ing data set building and data set analysis separately. 


Researchers who combine data collection and data analysis possess 

a temptation to adjust the data to improve results of their analyses. The 

EPA should consider separating these two functions, so as to remove the 

situation that presents this temptation. It should also consider combining 

this reform with a requirement that researchers provide a hold-out data 
set toatrusted third party before analysis, so that any analysis claim can 
be tested independently using the hold-out data set. 

175 Fora beginning, see Gal (2014); Kushida (2012). 

176 Cecil and Griffin have noted how an agency can insulate its actions from public scrutiny by funding a grant 
for controversial research and then basing its action on those findings. As long as the agency does not take 
possession or control of the records, FOIA requests—or other procedures to facilitate public oversight—will 
not assist those who wish to challenge the findings the agency relies on to justify its actions. Cecil (1985). The 


requirement for public access to research data will also ensure that the EPA does not undertake maneuvers of 
this nature. 
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7. The EPA should place greater weight on reproduced 


research. 


We have specified the use of improved statistical techniques to reduce 
the effects of the irreproducibility crisis in environmental epidemiol- 
ogy. But such statistical tests cannot catch every sort of questionable 
research procedure. Indeed, research that passes every statistical test 
might still be a false positive. The EPA therefore should increase the 
weight it assigns to research that is not only reproducible, but also repro- 
duced—and decrease the weight it assigns to research that has not yet 


been reproduced. 


8. The EPA should constrain the use of “weight of evidence” to 


take account of the irreproducibility crisis. 


The “weight of evidence” principle generally facilitates arbitrary 
judgments as to what science shouldinform regulation. Self-interest will 
inevitably incline scientists and regulators, consciously or unconsciously, 
to weigh more heavily research that facilitates regulation. Groupthink 
redoubles the effects of consensus-thinking, which too easily discards 
research that fails to endorse the consensus. 

Wherever possible, the EPA should substitute transparent rules for 
“weight of evidence” judgments. The EPA should also require regulators 
to elaborate in detail whenever they apply a “weight of evidence” judg- 
ment, by means of a coherent argument which can be falsified by inde- 


pendent critique. 


9. The EPA should report the proportion of positive results to 


negative results in the research it funds. 


The EPA’s bureaucratic self-interest—and its mandate—will always 
incline its employees, consciously or unconsciously, to fund research 
that supports regulation. The EPA must make a conscious effort to 


ensure that the research it funds does not put athumb on the scales of 
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the field’s research as a whole—that it does not fund an overabundance of 
false positive results and then say that the “weight of evidence” justifies 
regulation. 

The EPA should report the proportion of positive to negative results 
in the research it funds, with data reported for every program and disci- 
pline. Here we propose that any program or discipline that reports more 
than 65% positive results in the research it funds should initiate a reform 
of its granting program, to counter the effects of bureaucratic self-inter- 


est and groupthink. 


10. The EPA should not rely on research claims of other orga- 
nizations until these organizations adopt sound statistical 


practices. 


The EPA often funds external organizations, such as the World Health 
Organization (WHO), the International Agency for Research on Cancer 
(IARC), and the Health Effects Institute (HEI). These organizations are 


effectively beyond the reach of effective oversight. 


11. The EPA should increase funding to investigate direct 
causal biological links between substances and health 


outcomes. 


Environmental epidemiology depends on establishing statistical asso- 
ciations in default of establishing direct causal biological links between 
substances and health outcomes. The EPA’s reliance on association 
rather than causation weakens the justifications of its regulations. The 
EPA should redirect grant funding toward investigating direct causal 
biological links between substances and health outcomes, so as to mini- 
mize its reliance on statistical associations. A note of caution, however, 
is that direct experimentation on humans has been conducted and indi- 


cations of dire effects have not been found.!” 


177 Milloy (2016). 
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Asacorollary, the EPA also should place substantially greater weight 
on negative results in research to establish direct causal biological links. 
They should also establish a set procedure by which a sufficient number 
of such negative results preclude regulation absent research that proves 
statistical association to a substantially higher standard of rigor than at 
present required. 

All federal regulatory agencies, wherever relevant, should undertake 


parallel reforms. 


Scope and Implementation 


We believe the EPA should not overturn previous regulations arbi- 
trarily as it implements our recommendations. Regulatory stability is an 
important goal for the Federal government, and indeed for any system 
of laws and regulations. American enterprises have invested substantial 
resources in regulatory compliance, and their investments should not 
casually be set at naught.’ 

Extensive regulatory schemes can amount to acompetitive advantage 
to large corporations against small ones, since large companies have the 
capacity to comply with an extensive regulatory framework. Regulatory 
stability should not be used to provide an enduring competitive advan- 
tage to big business. Furthermore, regulatory costs are borne ultimately 
by American enterprises’ consumers—American citizens. These costs 
can have negative health consequences, such as those that follow from 
increased unemployment.’” 

When new data, new analysis, and new theory call into question and 
overturn previously established science, the regulations that now-dis- 
credited science once justified should be dismantled—if not in haste, then 
with all deliberate speed.'*° These reforms can and should be introduced 
via the EPA’s regular, planned regulatory reviews—which will allow the 


reform of procedures to enact new regulations to proceed in an orderly 
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manner.'*' But these regulatory reviews should not exempt existing regu- 
lations. We should not grandfather bad science forever—or even for very 
long. 

For a highly relevant example, consider the Harvard Six Cities/ACS 
studies that are cited in support of current PM, . regulation.” Enstrom 
asserts that Pope’s 1995 ACS II paper only achieved statistical signifi- 
cance by “data gardening,” since not all the datathat were available were 
used in the analysis.'** Other researchers have also subjected Dockery 
(1993) to severe criticism.'** Since the Dockery and Pope studies are not 
reproducible, and recent negative studies’® are, then how should the EPA 
unwind its regulations? 

We suggest a multi-part reform. The government should announce 
that it will cease using the Harvard Six Cities/ACS studies, and similarly 
irreproducible data sources, by some reasonably near date, unless the 
underlying data have been made publicly available. As the same time, the 
government should immediately begin to fund a high-priority program 
to create anew, substitute data set, with born-open, publicly accessible 
data and built-in de-identification to address any privacy concerns. 

These data will then be available for the EPA to use once it ceases 
using the Harvard Six Cities/ACS studies and similarly irreproducible 
data sources. If the new data do not justify the regulations, then the 
regulations can be withdrawn in an orderly manner. If the new data do 
justify the regulations, then the regulations can be continued. This multi- 
part reform should maximize reproducibility reforms and regulatory 
stability. 

Similarly crafted multi-part reforms, enacted throughout the EPA’s 
remit, ought to maximize the twin goods of good science and stable 


regulation. 
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Final Considerations 


We have used the phrase “irreproducibility crisis” throughout this 
report—and we should note that distinguished meta-researchers prefer 
to regard the current state of affairs as an “irreproducibility challenge.”'*® 
This is a serious caution. While we do think that questionable research 
procedures, p-hacking and HARKing, are endemic within science, and 
particularly within environmental epidemiology, we also recognize that 
not every reader will accept that such acrisis exists. 

For such readers, we say that you do not need to believe that there is an 
irreproducibility crisis. You can believe that it is better to regard these 
problems as irreproducibility challenges. Whether challenge or crisis, 
these scientific practices are not the best available science. We should use 
the best scientific practices simply because they are the best scientific 
practices. Mediocrity ought not be acceptable. 

This applies doubly to the science that underpins government regu- 
lation. Statistical research that seeks out associations must justify itself 
against the null hypothesis. Likewise, regulations that seek to restrict 
freedom must justify themselves against the null hypothesis of a free 
republic—that it is better for government to do nothing and for the repub- 
lic’s citizens to exercise their freedoms untrammeled. Research used 
to justify government regulations, even more than ordinary research, 
should survive every rigorous test available before it is taken as credible. 

This has long been the spirit of American regulatory policy. Our 
policymakers, representing the American people, long ago decided that 
regulations must justify themselves with the best available science—that 
is, science that has passed the severest tests. They used this phrase to 
defend liberty, not to facilitate its abrogation; to restrict regulation to the 
least necessary and not to facilitate the expansion of government regula- 
tion. Best available science was meant to restrict government bureaucrats, 


not to authorize them to build regulatory empires.’ 
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The reforms we suggest respond partly to the development of a regu- 
latory research regime entirely too fixed on extending regulations, 
regardless of the underlying science. These reforms also respond to the 
developing professional and public awareness of the irreproducibility 
crisis, and its ramifications. Finally, they build on new statistical strate- 
gies that have been devised to ensure that we are using the best available 
science. Even were there no scientific-regulatory complex, even were 
there no irreproducibility crisis, we would champion government adop- 
tion of these new methods to assess research, simply because they are the 
best methods. The American government should not be constrained by 
obsolescent methods or secret data as it seeks to judge the best science. 

We have subjected the science underpinning PM, .regulation toa seri- 
ous critique, and we believe the EPA should take account of this critique 
as it reforms these particular regulations. But we care even more about 
reforming the procedures the EPA uses in general to assess science—and 
the procedures used throughout government. 

Government regulatory procedure matters far more than any 
particular implementation of regulatory policy. Validation procedures 
for statistical data matter the most of all, regardless of how they affect 
government policy—for science cannot reliably seek out truth ona foun- 
dation of rotten procedure." This report focuses on government regula- 
tory policy, but we must never lose sight of that loftier goal. 

The government should use the very best science—whatever the 
regulatory consequences. Scientists should use the very best research 
procedures—whatever the result they find. Those principles are the twin 
keynotes of this report. The very best science and research procedures 
involve building evidence on the solid rock of transparent, reproducible, 


and reproduced scientific inquiry; not on shifting sands. 
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Appendix 1: Multiple Testing and Multiple Modeling (MTMM) and Epidemiology 


ultiple Testing and Multiple Modeling (MTMM) controls for 

experiment-wise error—the probability that at least one indi- 

vidual claim will register a false positive when you conduct 
multiple statistical tests.'* It is instructive to trace some of the history 
with examples of MTMM with respect to epidemiology. 

Friedman made a research claim in 1959 that Type A personality was 
associated with heart attacks.*° Several later studies failed to replicate 
these results. Expert committees found fault with these latter studies 
and the Type A personality—heart attack claim lives to this day. Yet 
Friedman’s initial study examined hundreds of distinct analytical ques- 
tions. It is very likely that the association is nothing more than a multi- 
ple-testing false positive. 

In 1974, a Lancet paper noted an association of the popular blood-pres- 
sure drug reserpine and breast cancer, with a p-value < 0.01.'* Several 
later studies failed to replicate these results.? Sam Shapiro, a co-author 


of the original Lancet paper, later explained that, 


Slone and I came to realize that our initial hypothesis-generating 
study was sloppily designed and inadequately performed. In addi- 
tion, we had carried out, quite literally, thousands of comparisons 
involving hundreds of outcomes and hundreds (if not thousands) of 
exposures. As a matter of probability theory, ‘statistically signifi- 
cant’ associations were bound to pop up and what we had described 


as a possibly causal association was really a chance finding.'* 


Yale epidemiologist Alvan Feinstein provided the first rigorous insight 
into epidemiology’s multiple testing (MTMM) problem in two 1988 papers. 


Feinstein’s first paper counted published studies for and against 56 
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different research claims and found that there were roughly an equal 
number of studies supporting each particular claim as there were stud- 
ies rejecting the claim. 

Feinstein’s second paper argued that a close analysis of these studies 
revealed that the researchers did not begin their research with a defined, 
single question. Instead, they allowed the data to define the question and 
then published the results.'*° An enormous proportion of epidemiology 
research conclusions were the result of multiple testing and (in modern 
nomenclature) HARKing, hypothesizing after the result was known. 

Statisticians have long been aware of the pitfalls of multiple testing: 
practitioners are keenly aware that error probabilities are not main- 
tained when there is multiple testing of the same set of data.” In the 1970s 
and 1980s, statisticians produced considerable literature on applied 
medical work that examined associations of blood types with disease.'® 

In 1985, Westfall observed that the relevant research produced multi- 
ple confidence intervals, and that these intervals could be made just 
wide enough to provide a proper correction parameter for the body of 
multiple tests by the use of resampling techniques that preserved the 
overall family-wise error rate. This assesses the chance of producing a false 
positive result while making multiple statistical tests. In other words, 
researchers who used resampling techniques now had a practical way 
to assess the probability that multiple testing had produced false posi- 
tive results.'”? Simulation could solve the otherwise intractable multiple 
testing problem. 

Epidemiologists, unfortunately, instead decided as a body to disre- 
gard the multiple-testing challenge identified by Feinstein. In 1990, 
the lead editorial in the very first issue of the new journal Epidemiology 
explicitly articulated this disregard in its title: “No Adjustments Are Needed 
for Multiple Comparisons.”*°° The discipline, alas, generally has followed 


this counsel. 
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A book offering practical solutions to the multiple testing problem has 
been available since 1993”" and it has been cited more than 3500 times 
since;?” but very rarely is it used or cited in the major epidemiology jour- 
nals.’ In 2000, Clyde did recognize that environmental epidemiology 
needed to account for multiple modeling and proposed a Bayesian model 
average as a solution.’™ The field also has paid limited attention to this 
alternate solution. Clyde (2000) has only been cited twice in the leading 
environmental epidemiology journal Environmental Health Perspectives.” 

Hayat et al. recently analyzed 216 randomly selected articles from a 
total of 1,023 published in 2013 at seven influential public health journals 
(American Journal of Public Health, American Journal of Preventive Medicine, 
International Journal of Epidemiology, European Journal of Epidemiology, 
Epidemiology, American Journal of Epidemiology, and Bulletin of the World 
Health Organization). Only 5.1% of the 216 studies they reviewed reported 
making statistical corrections for multiple testing.*°° We speculate that 
the studies that performed these corrections were in the genetic epidemi- 
ology subdiscipline. As a whole, epidemiologists have not subjected their 
research to the severe test of Multiple Testing and Multiple Modeling. 
Their unwillingness to subject their research to this easy and basic test 


warrants significant skepticism of all the field’s results. 
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tives, once in American Journal of Epidemiology, once in International Journal of Epidemiology, and never in 
Annals of Epidemiology or Epidemiology. 

204 Clyde (2000). 

205 GS (2020b). The two citing articles are Moolgavkar (2013); Roberts (2010). 

206 Hayat (2017). 
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Appendix 2: Constructing P-value Plots 


esearchers must be careful when they construct p-value plots. 

The decision about which p-values to plot must itself be made 

regularly and transparently, and the procedures disclosed in 

advance, since the choice of which p-values to assemble will itself influ- 
ence the result. 

That noted, it is not easy to manufacture a p-value plot that exhib- 

its only randomness. Professionals and the public should take any such 

result as very strong evidence indeed that the body of literature really 


demonstrates no significant association—that nothing interesting, noth- 


ing important, has been discovered. 


Behavior of P-value Plots for Simulated Data 


We illustrate in Figure A2.1 and Figure A2.2 below the expected behavior 

of p-values representing true null hypothesis (no association) outcomes for a 
simulated data set. Over the years, researchers have developed many tools to 
help people visualize the results from a series of experiments. The most basic of 
these is a simple scatter diagram, Figure A2.1. The scatter diagram represents 
the results from a simulated series of 100 experiments designed to confirm or 
reject some unspecified null hypothesis. The black dots are p-values, presented 
in chronological order from left to right. These p-values were simulated using a 
pseudo-random number generator; a uniform distribution on the interval [0, 1] 
was specified. The results appear to be quite random, resembling the pattern of 
holes from a shotgun blast. 


Now, instead of presenting these data in chronological order, we can sort them 

in ascending sequence, with the smallest p-values on the left, and the largest on 
the right. This is shown in Figure A2.2. Simply sorting the data brings order out 
of chaos. The random scattering now appears as a quasi-linear curve, meandering 
from the point (0, 0) (at the lower left) to the point (100, 1) (upper right). If we 
were to increase the number of simulated experiments and sort this larger number 
of uniformly distributed p-values, the resulting graph would look more and more 
like a straight line. And if the graph is scaled to be perfectly square, the “line” 
would be inclined at an angle of 45° to the x-axis, because it must run from the 
lower left corner to the upper right corner of a square. 


So here we have a simple procedure for determining if a series of experiments 
actually confirms the null hypothesis (instead of rejecting it). An upward sloping 
quasi-linear curve, appearing when p-values are plotted in ascending order, is a 
sort of fingerprint. Whenever it appears, we can be certain that the null hypothesis 
has not been rejected. 
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Figure A2.1: Scatter diagram of a simulated series of 100 experiments 
designed to confirm or reject some unspecified null hypothesis 
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Figure A2.1: Scatter diagram of a simulated series of 100 experiments 
designed to confirm or reject some unspecified null hypothesis 
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Appendix 3: Common Examples of Bilinear P-value Plot Behavior 


ase studies ultimately selected for meta-analysis tend to be 
carefully screened and evaluated so that they are consistent in 
addressing a single research question of interest and that they 
meet other rigorous eligibility criteria regarding methods and datasets. 
Studies selected for meta-analyses are (supposed to be) directly compa- 
rable, basically homogeneous, so they can be aggregated. A p-value plot 
of meta-analysis data contains p-values drawn from carefully selected 
base studies all examining a single research claim—e.g., whether risk 
factor A causes disease B. 
A p-value plot exhibiting bilinearity—two lines with very different 
slopes—ought not to exist in valid scientific literature. Such a p-value plot 


might be interpreted to mean that: 


for p-values < 0.05, the research claim is true—i.e., a true 
(positive) effect exists between risk factor A and disease B); 
or 

for p-values >0.05, the research claim is not true—i.e., anull 


(negative) effect exists between risk factor A and disease B). 


Logically, both outcomes cannot be true. Widespread existence 
of p-hacking (which can be identified by counting) and publication 
bias support an interpretation that the research claim is not true.*” 
Confirmation bias is part of the process. Scientists making a research 
claim are required to back their claim up with strong evidence. They 
need to be able to explain away null findings (those carefully selected 
base studies with p-values >0.05). In science, one cannot “prove a nega- 
tive” but anumber of p-values on a 45-degree line is strong evidence that 
the negative effect is true. 

Having stated this, bilinearity does not provide incontrovertible 
evidence of publication bias, p-hacking, or HARKing. Occasionally real 
effects will register as bilinear p-value plots—but this occurs rather 
rarely and can indicate an undetected and/or imperfectly controlled 
variable. Professionals and the public should take a bilinear p-value plot 


207 Bruns (2016); Head (2015); but see Hartgerink (2017); Tanner (2015) 
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as a compelling reason to re-examine a field of literature for distorting 
biases. A given field’s conclusions should be regarded skeptically until the 
questions raised by a bilinear p-value plot have been resolved. 

Three common examples of p-value plots of meta-analyses drawn 
from published literature that exhibit bilinear characteristics are 
provided. We should mention that our experience so far has been that 
most meta-analyses subjected to p-value plotting exhibit bilinearity. 
We strongly suspect that a disturbingly large portion of claims made in 
meta-analyses lack statistical support—although we cannot yet quantify 
that judgment. 

Figure A3.1 plots the p-values of a meta-analysis of 23 RCTs that 
examined the mean difference in reduction of chronic low back pain at 
one month for spinal manipulative therapy (SMT) versus recommended 
therapies in adults older than 18 with chronic low back pain.*°* The 
meta-analysis claimed that, “SMT produces similar effects to recommended 


therapies for chronic low back pain...” 


Figure A3.1: P-value Plot, Meta-analysis of 23 RCT data sets, Mean 
difference in reduction of chronic low back pain at 1 month for spinal 
manipulative therapy (SMT) versus recommended therapies 


p-value 


o 2 4 6 8 10 12 14 16 18 20 22 24 
Rank order 


208 Rubinstein (2019). The 23 randomized controlled trials are listed in Figure 2 in Rubinstein (2019). 
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Figure A3.1 clearly depicts bilinear plot behavior. One cluster of small 
p-values follows a near-horizontal line—but most p-values, including 
several that are less than 0.05, approximately follow a 45-degree line. The 
pattern strongly suggests that publication bias, p-hacking, or HARKing 
has altered a field whose results supported the presumption of random- 
ness—that there is no consistent overall positive association of spinal 
manipulative therapy and reduction in chronic low back pain compared 
to recommended therapies. 

We note that in this instance the meta-analysis did not result in a 
false positive claim being made: Rubinstein et al. concluded that “SMT 
produces similar effects to recommended therapies for chronic low back 
pain.”®°° The next two meta-analysis examples show similar bilinear 
p-value plot characteristics, however with likely false positive claims 
being made. 

Figure A3.2 plots the p-values of a meta-analysis of 19 randomized 
clinical trials that examined the association between anxiety symp- 
toms and omega-3 polyunsaturated fatty acids treatment compared 
with controls in varied populations.*° This meta-analysis claimed that, 


“omega-3 PUFAs might help to reduce the symptoms of clinical anxiety.” 


209 Rubinstein (2019). 
210 Su (2018). The 19 clinical trials are listed in Su (2018) as References 33-36, 47-61. 
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Figure A3.2: P-value Plot, Meta-analysis of 19 RCT data sets, Associa- 
tion of treatment with reduced anxiety symptoms in patients receiving 
and not receiving omega-3 polyunsaturated fatty acids 


p-value 
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Figure A3.2 is similar to Figure A3.1, with p-values again taking a 
bilinear shape - acluster of small p-values along the horizon, and most 
p-values approximately following a near 45 degree line. The pattern 
strongly suggests that publication bias, p-hacking, or HARKing has 
altered a field whose results supported the presumption of randomness— 
that there is no consistent overall positive association across the studies 
used in the meta-analysis. 

In this instance, the meta-analysis registered a statistically signifi- 
cant association: “This review indicates that omega-3 PUFAs might help to 
reduce the symptoms of clinical anxiety.” The research field’s result had 
shifted not only in the direction of statistical significance, but across the 
line. 

Figure A3.3 plots the p-values of a meta-analysis of 17 field observa- 
tional studies examining the association between inferred exposure to 


PM, ,in ambient air and lung cancer incidence and mortality—16 cohort 
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studies and one case-control study.”" This final meta-analysis made the 
following claim “Our findings suggest that long-term exposure to PM, ,is 


significantly associated both with LC [lung cancer] incidence and mortality.” 


Figure A3.3: P-value Plot, Meta-analysis of 17 field observational 
data sets, Association between inferred exposure to PM, ,, in ambient 
air and lung cancer incidence and mortality 


p-value 


0 2 4 6 8 10 12 14 16 18 
Rank order 


Figure A3.3 depicts bilinear plot behavior, with a cluster of small 
p-values on a nearly horizontal line and most p-values approximately 
following a 45-degree line. Although the data in Figure A3.3 are less 
clear than in Figure A3.1 and Figure A3.2, the distribution of the p-val- 
ues from independent studies representing PM, .in ambient air and lung 
cancer incidence and mortality clearly deviates from true nulls (Figure 
4) or significant association (Figure 5). The curve forms an intermediate 
bow-shape rather than a 45-degree line (strong evidence of no signifi- 
cant association) or a horizontal line (strong evidence of a significant 
association). 

The general pattern suggests that publication bias, p-hacking, or 
HARKing may have altered a field whose results, properly corrected, 


211 Huang (2017). The 17 field observational studies are listed in Huang (2017) as References 7-23. 
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would indicate that there is no consistent overall positive association 
across the studies used in the meta-analysis. The evidence is mixed, but 
still strong enough to suggest that the claim of a significant association 
requires further investigation before it is accepted by researchers in the 
field. 
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Appendix 4: PM2.5—Mortality Causality—Incomplete Evidence 


hroughout this report we are addressing usual (typical) analysis 

methods used by researchers in the field of environmental epide- 

miology, accepting the data as given. We point out that multiple 
testing and multiple modeling (MTMM) problems are ignored and argue 
that any environmental epidemiology claims made without addressing 
MTMM problems lack sufficient statistical support. 

We have not addressed other fundamental criticisms about the way 
environmental epidemiology researchers consider causality.” For exam- 
ple, Cox recently provided an up-to-date summary of criticisms that 
continue to plague PM, .-mortality causality studies in environmental 


epidemiology, including: 


Omitted predictors and confounders. 

Uncontrolled residual confounding. 

Unmodeled interactions among variables. 

Untested and incorrect modeling assumptions. 
Unmodeled exposure uncertainties. 

Unjustified interventional causal interpretation of regres- 


sion coefficients.” 


If you change a system, it should change in a predictable way. 
Classically, causality is established with experiments. Many factors are 
directly controlled, other factors are statistically controlled, and exper- 
imental units are assigned treatments at random. The entire process is 
preplanned and specified. After the experiment is run, the response 
of the system is examined. Absent full-scale experiments, so called 
quasi-experiments can provide information on causality. 

Here we provide three examples of environmental epidemiology 
quasi-experiments that invalidate PM, ,-mortality causal associations.” 

In 1970 the Clean Air Act Amendments designated U.S. counties with 
annual, average total suspended particulates (TSP) greater thana thresh- 


old as nonattainment locations. These nonattainment locations faced 


212 Briggs (2016); Briggs (2017); Cox (2017); Cox (2020). 
213 Cox (2020); also see CASAC (2019). 
214 Chay (2003); also see reanalysis by Obenchain (2017); Zu (2016); data from Young (2017a). 
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stricter regulations starting in 1972 than those in attainment locations. A 
natural experiment came into being with nonattainment and attainment 
counties forming two different groups.* The full data set covered 560 
U.S. counties for six consecutive years (1969-1974) and was subjected to 
analysis comparing death rates in nonattainment to attainment counties. 
Air quality improved over the six years in nonattainment counties, but 
death rates did not. Using the same data, but a different analysis strategy, 
Obenchain came to the same conclusion and they also made the data set 
public.*® 

In another noteworthy study, forest fires in Quebec in the summer of 
2002 resulted in forest fire smoke migrating down the U.S. east coast.” 
PM, . measurement data and mortality data were obtained for a 4-week 
period in July 2002 for Greater Boston (over 1.7 million people) and New 
York City (over 8 million people). Daily average PM, , concentrations were 
noticeably increased in both cities for 3 days during this period—reaching 
as high as 63 ng/m*in Boston and 86 pg/m? in New York City versus 4-48 
ug/m? in non-smoke days. Temporal patterns of natural-cause deaths 
and daily average PM, , concentrations did not indicate any discernible 
increase in daily mortality in either city for high- versus non-smoke days. 

Finally, Young studied and made public a data set containing daily 
deaths, daily air quality levels (PM, , and ozone), daily temperature 
levels (minimum and maximum), and daily maximum relative humidity 
levels for the eight most populous California air basins.*" The data set 
encompassed thirteen years, more than 2 million deaths and over 37,000 
exposure days. The data set was analyzed using time series analysis. A 
sensitivity analysis was computed varying model parameters, locations 
and years—which included over 70,000 variations of analysis. The study 
found little evidence for association between air quality and deaths. 
Within the data set, there were several smoke/PM, , events and they did 


not exhibit correlations with daily deaths.”!® 


215 Chay (2003). 

216 Obenchain (2017). 
217° Zu (2016). 

218 Young (2017a). 
219 Young (2014). 
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Our MTMM critique does not depend on these questions of causality. 
Nevertheless, we also recommend that environmental epidemiology 
researchers and EPA regulators take account of this separate, serious 


causality critique. 
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Appendix 5 
All-cause mortality 


e investigated the reliability of claims from studies used 
in meta-analysis associating PM,, and other air quality 
components with all-cause mortality. We analyzed 29 risk 
ratios and confidence limits taken from 27 researchers for all-cause 


mortality and PM, .; two researchers had two papers. While other 


2.5? 


components were available for analysis, here we analyzed only results 


pertaining to PM, . (see Figure A5.1). 


Figure A5.1: Natural log of the Environmental Effect [Ln(EE)]and its 
Standard Error [SE Ln(EE)] for 29 Risk Ratios and Confidence Limits of 
PM2.5-All-Cause Mortality Associations’”° 


Rank ID Author Year Ln(EE) Reges Z value p-value 
1 337 Dai 2014 0.011731 0.001309 8.959154 3.27E-19 
2D) 273 Chen 2011 0.004589 0.000558 8.219683 2.04E-16 
3 758 Lee 2016 0.01548 0.001905 8.123914 4.51E-16 
4 772 Li 2017 0.001699 0.000306 550/115) 2.7E-08 
5: 3633 Janssen 2013 0.007968 0.002021 3.943442 8.03E-05 
6 1774 Li 2018 0.002497 0.000661 3.776385 0.000159 
7 1409 Tsai 2014 0.039268 0.010868 3.613243 0.000302 
8) 5 Madsen 2012 0.027615 0.00788 3.504571 0.000457 
9 1733 Wu 2018 0.005485 0.001571 3.492337 0.000479 
10 1714 Reyna 2012 0.008243 0.002527 3.261944 0.001107 
11 489 Garret 2011 0.006678 0.002477 2.695497 0.007028 
25 eoOll Hong 2017 0.112056 0.049633 2.257704 0.023964 


13. 245 = Castillejos 2000 0.014692 0.007387 1.988796 0.046724 
4s elit Burnett 2004 0.005993 0.003191 1.877872 0.060399 
15 1691 Dockery 1992 0.017059 0.009617 1.773869 0.076085 
16 84 Atkinson 2016 -0.01419 0.008234 -1.72311 0.084868 
17 1691 Dockery 1992 0.022793 0.018614 1.224494 0.220766 
18 975 Neuberger 2007 0.004988 0.005052 0.987326 0.323483 


220 Orellano (2020). We have extracted these data from Orellano (2020), Appendix A, Figure A.5. 
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19 1709 Simpson 2000 0.007997 0.00815 0.981175 0.326506 
20 974 Neuberger 2013 0.003992 0.004553 0.876757 0.380618 


21 1695 Peters 2009 0.004888 0.006454 0.757427 0.448794 
22 1411 Tsai 2014 0.007692 0.012302 0.625253 0.531805 
23 (748 Lanzinger 2016 -0.00404 0.006971 -0.57993 0.561965 


24 5g Anderson 2001 0.00338 0.005955 0.567519 0.570362 


25. 827 _ LOPEZ 5019 -0.00914 0.016166 _—-0.56548 0.571746 
Villarrubia 

26 827 _ LOPEZ 5019 -0,.00682 0.016925 -0.40314 0.686842 
Villarrubia 

27-186 Branis 2010 -0.002 0.006098 -0.3283 0.742686 


230 We20) Basagana 2015 0.001112 0.005225 0.212729 0.831538 
29, 722 Kollanus 2016 0.001998 0.041073 0.048645 0.961202 


The claimed effect possessed a risk ratio of 1.0065, with 95% con- 
fidence limits of 1.0044 to 1.0086—that is, elevated PM? concentra- 
tions imposed a higher risk of all-cause mortality. A risk ratio of 1.000 
is considered to register no effect; a risk ratio of 2.000 would provide 
evidence that elevated PM*° concentrations imposed a higher risk of all- 
cause mortality; risk ratios between 1.000 and 2.000 indicate increas- 
ing evidence that elevated PM** concentrations impose a higher risk of 
all-cause mortality, from no effect to a strong effect. 

So far as we know, a claim of 1.0065 is the smallest risk ratio claim 
for PM** that has been declared “real” in the literature. Indeed, to our 
knowledge, this is the smallest risk ratio claim which has ever been con- 
sidered in any scientific literature. So small a risk ratio indicates that 
even a small distortion in the base papers—whether from publication 
bias, HARKing, insufficient correction for MTMM, or other questionable 
research procedures—might have produced this “real” effect as a statis- 


tical artifact. 


Heart attacks 


We investigated the reliability of claims from studies used in me- 


ta-analysis of short-term air quality-heart attack associations. The 


Appendix 5 


number of outcomes, predictors, covariates and time lags available in 


each of the 34 studies were counted to estimate the number of statisti- 


cal tests performed (Figure A5.2). 


Figure A5.2: Outcomes, Predictors, and Lags in 34 studies, associations 


between air quality components and heart attacks 
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221 Young (2019b). The citation numbers in the first column reproduce the citation numbers in Mustafic (2012). 


The 34 studies are listed in Mustafic (2012) as References 7-12, 19-46 
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36 ~~ D'Ippoliti 3 4 3 11 36 2,048 73,728 
37 Henrotin 4 5 14 14 280 16,384 4,587,520 
38 Ueda 3 1 3 7 9) 128 1,152 
39 Mann 4 4 7 9 112 512 57,344 
40  Sharovsky 4 3 8 10 96 1,024 98,304 
41 Belleudi 4 3 13 8 156 256 39,936 
42 Nuvolone 1 3 8 9 24 512 12,288 
43 Peters 4 5 4 10 80 1,024 81,920 
44 Ruidavets 4 3 4 8 48 256 12,288 
45 Zanobetti 2 6 3 7 36 128 4,608 
46 Bhaskaran 1 5 7 i 25 128 3,200 


We note that because these studies introduce statistical tests on lags, the number of Questions 
is no longer Outcomes x Predictors, but Outcomes x Predictors x Lags. Models = 2‘ where k = 
number of Covariates. Search space = approximation of analysis search space = Questions x 


Models??? 


Asthma 


We investigated the reliability of claims from studies used in two 
meta-analyses of air quality-asthma attack associations: the first me- 
ta-analysis (Anderson) analyzed long-term cohort studies related to 
development of asthma and the second meta-analysis (Zheng) analyzed 


short-term time-series studies related to asthma attack. **? 
Causes and Risk factors 


Asthma is associated with three principal characteristics: (1) vari- 
able airways obstruction, (2) airway hyperresponsiveness, and spasm 
with wheezing and coughing and (3) airway inflammation. Asthma may 
be characterized clinically by episodic, reversible obstruction of air- 
ways that variably presents as symptoms ranging from cough to wheez- 
ing, shortness of breath, or chest tightness. 

The diagnosis of asthma/reactive airways is challenging. Asthma 


that starts young does not start in babies; it develops when victims are 


222 "Search space” might also be called “sample space.” We have already established the term “search space,” 
in the professional literature (e.g., Young (2019b)), and we will continue to use it here. 
223 Anderson (2013); Zheng (2015). 
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toddlers. Asthma that develops prior to the teen years may develop in 
hyperresponders (possessing an abnormally high degree of respon- 
siveness) who would have benefited from desensitization due to lack of 
immune challenges in postneonatal immune system development. If 
adults who have smoked for a time develop bronchitis, they will demon- 
strate reactive airways that are triggered by various mechanisms—e.g., 
fumes, cold air, exercise—and so they may display the symptoms of late 
onset asthmatics. 

There is even continuing debate as to whether asthma is one dis- 
ease or several different diseases that include airway inflammation; 
however, two thirds (or more) of asthmatic patients have an allergic 
component to their disease and are thought to have allergic asthma. Not 
enough is currently known to rule out allergic causes in a vast majority 
of asthmatic problems. 

As for development of asthma, the disease frequently first express- 
es itself early in the first few years of life, arising from a combination of 
genetic and non-genetic factors. Most investigators would agree there is 
a major hereditary contribution to the underlying causes of asthma and 


allergic diseases.**+ 


Prevalence 


We looked at asthma prevalence in the American (US) population in 
relation to ambient air quality. We accessed asthma prevalence data 
from the US Centers for Disease Control and Prevention (Atlanta, GA). 
The prevalence data we report here are from annual national surveys 
conducted by the National Center for Health Statistics (NCHS), the US 
Department of Health & Human Services, and are self-reported by 
respondents to the National Health Interview Survey. (Figure A5.3). 
Asthma prevalence in the US population increased from 4.2% in 1990 to 
8% in 2006 and has since been relatively stable. Asthma prevalence in the 


US population was 7.9% in 2017. 


224 Young (in progress). 
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We accessed US annual national air quality concentration averages 
over the same period (1990-2017) from the US Environmental Protection 
Agency. (Figure A5.4.). As shown in Figure A5.4, all air quality compo- 
nents of interest to the EPA declined (range 22-88%) over the 27-year 
period between 1990 and 2017. 

In particular, US ambient NO, concentrations declined by 50% and 
PM, , concentrations by 40% (2000-2017). Whereas prevalence of asth- 
ma in the US (population-weighted) increased by 88% during the same 
period. These conflicting trends suggest that other factors, rather than 
air quality components, may be more important in the development of 


asthma later in life. 


Counts 


Counts and analysis search spaces in 19 base studies of the first me- 
ta-analysis related to development of asthma are shown in Figure A5.5. 
Figure A5.3: Annual prevalence of asthma in the US, population-weight- 


ed, 1990-2017 as reported by National Center for Health Statistics, 
National Health Interview Surveys” 
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225 Young (in progress). 
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Figure A5.4: Change in US annual national air quality concentration 
averages and annual prevalence of asthma in the US (population-weight- 
ed) over the 27-year period 1990-2017 (except as noted)*”® 


Parameter Change 

Nitrogen Dioxide (NO2) 1-Hour —50% 
Nitrogen Dioxide (NO2) Annual —56% 
Particulate Matter 2.5 microns (PM2.5) 24-Hour ~40% 
(2000-2017) 

Particulate Matter 2.5 microns (PM2.5) Annual -A1% 
(2000-2017) 

Particulate Matter 10 microns (PM10) 24-Hour 34% 
(2010-2017) 

Carbon Monoxide (CO) 8-Hour -71% 
Ozone (O3) 8-Hour —22% 
Sulfur Dioxide (SO2) 1-Hour -88% 
Prevalence of asthma +88% 


Figure A5.5: Counts and analysis search spaces in 19 base studies, asso- 
ciations between air quality components and development of asthma®”’ 


Study cohort pec sadn Lags ae ree Models pho 
ate 
BAMSE Ze 3 4 6 84 64 5,376 
British Columbia 8 4 7 32 128 4,096 
CHS 2 8 15 16 32,768 524,288 
CHS 6 5 10 30 1,024 30,720 
CHS 2003 5 =| 15 15 32,768 491,520 
CHIBA 3 1 3 6 9 64 576 
CHIBA 3 6 6 18 64 1,152 
CHIBA 5 4 4 8 80 256 20,480 
ECHRS 1 1 6 11 6 2,048 12,288 
GINIplus+LlSAplus 4 4 6 12 96 4,096 393,216 


226 Young (in progress) 
227 Anderson (2013); Young (in progress). 
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Counts and analysis search spaces in 34 base studies of the second me- 


ta-analysis related to asthma attack are shown in Figure A5.6. 


Figure A5.6: Counts and analysis search spaces in 34 base studies, asso- 


ciations between air quality components and asthma attack’*® 
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